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Preface 


Machine leaming is a name that is gaining popularity as an umbrella and evolution for methods that 
have been studied and developed for many decades in different scientific communities and under differ¬ 
ent names, such as statistical learning, statistical signal processing, pattern recognition, adaptive signal 
Processing, image processing and analysis, system identification and control, data mining and infor- 
mation retrieval, computer vision, and computational learning. The name “machine learning” indicates 
what all these disciplines have in common, that is, to learnfrom data, and then make predictions. What 
one tries to learn from data is their underlying structure and regularities, via the development of a 
model, which can then be used to provide predictions. 

To this end, a number of diverse approaches have been developed, ranging from optimization of cost 
functions, whose goal is to optimize the deviation between what one observes from data and what the 
model predicts, to probabilistic models that attempt to model the statistical properties of the observed 
data. 

The goal of this book is to approach the machine learning discipline in a unifying context, by pre- 
senting major paths and approaches that have been followed over the years, without giving preference 
to a specific one. It is the author’s belief that all of them are valuable to the newcomer who wants to 
learn the secrets of this topic, from the applications as well as from the pedagogic point of view. As the 
title of the book indicates, the emphasis is on the processing and analysis front of machine learning and 
not on topics concerning the theory of learning itself and related performance bounds. In other words, 
the focus is on methods and algorithms closer to the application level. 

The book is the outgrowth of more than three decades of the author’s experience in research and 
teaching various related courses. The book is written in such a way that individual (or pairs of) chapters 
are as self-contained as possible. So, one can select and combine chapters according to the focus he/she 
wants to give to the course he/she teaches, or to the topics he/she wants to grasp in a first reading. Some 
guidelines on how one can use the book for different courses are provided in the introductory chapter. 

Each chapter grows by starting from the basies and evolving to embrace more recent advances. 
Some of the topics had to be split into two chapters, such as sparsity-aware learning, Bayesian learning, 
probabilistic graphical models, and Monte Carlo methods. The book addresses the needs of advanced 
graduate, postgraduate, and research students as well as of practicing scientists and engineers whose 
interests lie beyond black-box approaches. Also, the book can serve the needs of short courses on spe¬ 
cific topics, e.g., sparse modeling, Bayesian learning, probabilistic graphical models, neural networks 
and deep learning. 

Second Edition 

The first edition of the book, published in 2015, covered advances in the machine learning area up to 
2013-2014. These years coincide with the start of a real booming in research activity in the field of deep 
learning that really reshaped our related knowledge and revolutionized the field of machine learning. 
The main emphasis of the current edition was to, basically, rewrite Chapter 18. The chapter now covers 
a review of the field, starting from the early days of the perceptron and the perceptron rule, until 
the most recent advances, including convolutional neural networks (CNNs), recurrent neural networks 
(RNNs), adversarial examples, generative adversarial networks (GANs), and capsule networks. 
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Also, the second edition covers in a more extended and detailed way nonparametric Bayesian meth- 
ods, such as Chinese restaurant processes (CRPs) and Indian buffet processes (IBPs). It is the author’s 
belief that Bayesian methods will gain in importance in the years to come. Of course, only time can 
teli whether this will happen or not. However, the author’s feeling is that uncertainty is going to be 
a major part of the future models and Bayesian techniques can be, at least in principle, a reasonable 
start. Concerning the other chapters, besides the (omnipresent!) typos that have been corrected, changes 
have been included here and there to make the text easier to read, thanks to suggestions by students, 
colleagues, and reviewers; I am deeply indebted to all of them. 

Most of the chapters include MATLAB® exercises, and the related code is freely available from 
the book’s companion website. Furthermore, in the second edition, all the computer exercises are also 
given in Python together with the corresponding code, which are also freely available via the website 
of the book. Finally, some of the computer exercises in Chapter 18 that are related to deep learning, 
and which are closer to practical applications, are given in Tensorflow. 

The Solutions manucil as well lecture slides are available from the book’s website for instructors. 

In the second edition, all appendices have been moved to the website associated with the book, and 
they are freely downloadable. This was done in an effort to save space in a book that is already more 
than 1100 pages. Also, some sections dedicated to methods that were present in various chapters in the 
first edition, which I felt do not constitute basic knowledge and current mainstream research topics, 
while they were new and “fashionable” in 2015, have been moved, and they can be downloaded from 
the companion website of the book. 

Instructor site URL: 

http://textbooks.elsevier.com/web/Manuals.aspx?isbn=9780128188033 

Companion Site URL: 

https://www.elsevier.com/books-and-journals/book-companion/9780128188033 
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Notation 


I have made an effort to keep a consistent mathematical notation throughout the book. Although every 
Symbol is defined in the text prior to its use, it may be convenient for the reader to have the list of major 
symbols summarized together. The list is presented below: 

• Vectors are denoted with boldface letters, such as x. 

• Matrices are denoted with capital letters, such as A. 

• The determinant of a matrix is denoted as det{A}, and sometimes as | A|. 

• A diagonal matrix with elements ai, 02 ,.... a/ in its diagonal is denoted as A — diagfaj. « 2 ,...,«/}. 

• The identity matrix is denoted as I. 

• The trace of a matrix is denoted as tracej A }. 

• Random variables are denoted with roman fonts, such as x, and their corresponding values with 
mathmode letters, such as x. 

• Similarly, random vectors are denoted with roman boldface, such as x, and the corresponding values 
as x. The same is true for random matrices, denoted as X and their values as X. 

• Probability values for discrete random variables are denoted by capital P , and probability density 
functions (PDFs), for continuous random variables, are denoted by lower case p. 

• The vectors are assumed to be column-vectors. In other words, 



XI 


■ X(l) - 


X 2 


x( 2 ) 

X = 


, or x = 



XI 


_ x(l) _ 


That is, the 1 th element of a vector can be represented either with a subscript, x,-, or as x (i). 
• Matrices are written as 



xn xn 

■ ■ • xn 


_ X(l,l) 

X(l,2) . 

.. X(l,l) - 

X = 



, orX = 





_ xn xn 

... XII _ 


_ X(/,l) 

X(l, 2) . 

.. X(IJ) 


• Transposition of a vector is denoted as x 1 and the Hermitian transposition as x H . 

• Complex conjugation of a complex number is denoted as x* and also ~J— 1 := j . The Symbol 
denotes definition. 

• The sets of real, complex, integer, and natural numbers are denoted as R, C, Z, and N, respectively. 

• Sequences of numbers (vectors) are denoted as x n (x n ) or x(n) (x(n)) depending on the context. 

• Functions are denoted with lower case letters, e.g., /, or in terms of their arguments, e.g., f(x) or 
sometimes as /(•), if no specific argument is used, to indicate a function of a single argument, or 
/(•,•) for a function of two arguments and so on. 
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1.1 THE HISTORICAL CONTEXT 

During the period that covers, roughly, the last 250 years, humankind has lived and experienced three 
transforming revolutions, which have been powered by technology and Science. The first industrial 
revolution was based on the use of water and steam and its origins are traced to the end of the 18th 
century, when the first organized factories appeared in England. The second industrial revolution was 
powered by the use of electricity and mass production, and its “birth” is traced back to around the turn 
of the 20th century. The third industrial revolution was fueled by the use of electronics, information 
technology, and the adoption of automation in production. Its origins coincide with the end of the 
Second World War. 

Although difficult for humans, including historians, to put a stamp on the age in which they them- 
selves live, more and more people are claiming that th e fourth industrial revolution has already started 
and is fast transforming everything that we know and learned to live with so far. The fourth industrial 
revolution builds upon the third one and is powered by th efusion of a number of technologies, e.g.. 
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computers and Communications (internet), and it is characterized by the convergence of the physical, 
digital, and biological spheres. 

The terms artificial intelligence (AI) and machine learning are used and spread more and more to 
denote the type of automation technology that is used in the production (industry), in the distribution of 
goods (commerce), in the Service sector, and in our economic transactions (e.g., banking). Moreover, 
these technologies affect and shape the way we socialize and interact as humans via social networks, 
and the way we entertain ourselves, involving games and cultural products such as music and movies. 

A distinet qualitative difference of the fourth, compared to the previous industrial revolutions, is 
that, before, it was the manual skills of humans that were gradually replaced by “machines.” In the 
one that we are currently experiencing, mental skills are also replaced by “machines.” We now have 
automatic answering Software that runs on computers, less people are serving us in banks, and many 
jobs in the Service sector have been taken over by computers and related Software platforms. Soon, we 
are going to have cars without drivers and drones for deliveries. At the same time, new jobs, needs, 
and opportunities appear and are created. The labor market is fast changing and new competences and 
skills are and will be required in the future (see, e.g., [22,23]). 

At the center of this historical happening, as one of the key enabling technologies, lies a discipline 
that deals with data and whose goal is to extract information and related knowledge that is hidden in 
it, in order to make predictions and, subsequently, take decisions. That is, the goal of this discipline is 
to leam from data. This is analogous to what humans do in order to reach decisions. Learning through 
the senses, personal experience, and the knowledge that propagates from generation to generation is 
at the heart of human intelligence. Also, at the center of any scientific field lies the development of 
models (often called theories) in order to explain the available experimental evidence. In other words, 
data comprise a major source of learning. 


1.2 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 

The title of the book refers to machine learning, although the term artificial intelligence is used more 
and more, especially in the media but also by some experts, to refer to any type of algorithms and 
methods that perform tasks that traditionally required human intelligence. Being aware that defmitions 
of terms can never be exact and there is always some “vagueness” around their respective meanings, 
I will stili attempt to clarify what I mean by machine learning and in which aspects this term means 
something different from AI. No doubt, there may be different views on this. 

Although the term machine learning was popularized fairly recently, as a scientific field it is an 
old one, whose roots go back to statisties, computer Science, information theory, signal processing, 
and automatic control. Examples of some related names from the past are statistical learning, pattern 
recognition, adaptive signal processing, system Identification, image analysis, and speech recognition. 
What ali these disciplines have in common is that they process data, develop models that are data- 
adaptive, and subsequently make predictions that can lead to decisions. Most of the basic theories and 
algorithmic tools that are used today had already been developed and known before the dawn of this 
century. With a “small” yet important difference: the available data, as well as the computer power 
prior to 2000, were not enough to use some of the more elaborate and complex models that had been 
developed. The terrain started changing after 2000, in particular around 2010. Large data sets were 
gradually created and the computer power became affordable to allow the use of more complex mod- 
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eis. In tum, more and more applications adopted such algorithmic techniques. “Learning from data” 
became the new trend and the term machine learning prevailed as an umbrella for such techniques. 

Moreover, the big difference was made with the use and “rediscovery” of what is today known 
as deep neural networks. These models offered impressive predictive accuracies that had never been 
achieved by previous models. In turn, these successes paved the way for the adoption of such models in 
a wide range of applications and also ignited intense research, and new versions and models have been 
proposed. These days, another term that is catching up is “data Science,” indicating the emphasis on 
how one can develop robust machine learning and computational techniques that deal efficiently with 
large-scale data. 

However, the main rationale, which runs the spine of all the methods that come under the machine 
learning umbrella, remains the same and it has been around for many decades. The main concept is 
to estimate a set of parameters that describe the model, using the available data and, in the sequel, 
to make predictions based on low-level Information and signals. One may easily argue that there is 
not much intelligence built in such approaches. No doubt, deep neural networks involve much more 
“intelligence” than their predecessors. They have the potential to optimize the representation of their 
low-level input information to the computer. 

The term “representation” refers to the way in which related information that is hidden in the input 
data is quantified/coded so that it can be subsequently processed by a computer. In the more technical 
jargon, each piece of such information is known as a feature (see also Section 1.5.1). As discussed in 
detail in Chapter 18, where neural networks (NNs) are defined and presented in detail, what makes these 
models distinctly different from other data learning methods is their multilayer structure. This allows 
for the “building” up of a hierarchy of representations of the input information at various abstraction 
levels. Every layer builds upon the previous one and the higher in hierarchy, the more abstract the 
obtained representation is. This structure offers to neural networks a significant performance advantage 
over alternative models, which restrict themselves to a single representation layer. Furthermore, this 
single-level representation was rather hand-crafted and designed by the users, in contrast to the deep 
networks that “learn” the representation layers from the input data via the use of optimality criteria. 

Yet, in spite of the previously stated successes, I share the view that we are stili very far from what 
an intelligent machine should be. For example, once trained (estimating the parameters) on one data 
set, which has been developed for a specific task, it is not easy for such models to generalize to other 
tasks. Although, as we are going to see in Chapter 18, advances in this direction have been made, we are 
stili very far from what human intelligence can achieve. When a child sees one cat, readily recognizes 
another one, even if this other cat has a different color or if it turns around. Current machine learning 
systems need thousands of images with cats, in order to be trained to “recognize” one in an image. If a 
human learns to ride a bike, it is very easy to transfer this knowledge and learn to ride a motorbike or 
even to drive a car. Humans can easily transfer knowledge from one task to another, without forgetting 
the previous one. In contrast, current machine learning systems lack such a generalization power and 
tend to forget the previous task once they are trained to learn a new one. This is also an open field of 
research, where advances have also been reported. 

Furthermore, machine learning systems that employ deep networks can even achieve superhuman 
prediction accuracies on data similar to those with which they have been trained. This is a significant 
achievement, not to be underestimated, since such techniques can efficiently be used for dedicated 
jobs; for example, to recognize faces, to recognize the presence of various objects in photographs, and 
also to annotate images and produce text that is related to the content of the image. They can recognize 
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speech, translate text from one language to another, detect which music piece is currently playing in the 
bar, and whether the piece belongs to the jazz or to the rock musical genre. At the same time, they can 
be fooled by carefully constructed examples, known as adversarial examples, in a way that no human 
would be fooled to produce a wrong prediction (see Chapter 18). 

Concerning AI, the term “artificial intelligence” was first coined by John McCarthy in 1956 when 
he organized the first dedicated conference (see, e.g., [20] for a short history). The concept at that time, 
which stili remains a goal, was whether one can build an intelligent machine, realized on Software and 
hardware, that can possess human-like intelligence. In contrast to the field of machine learning, the 
concept for AI was not to focus on low-level information processing with emphasis on predictions, but 
on the high-level cognitive capabilities of humans to reason and think. No doubt, we are stili very far 
from this original goal. Predictions are, indeed, part of intelligence. Yet, intelligence is much more than 
that. Predictions are associated with what we call inductive reasoning. Yet what really differentiates 
human from the animals intelligence is the power of the human mind to form concepts and create 
conjectures for explaining data and more general the World in which we live. Explanations comprise 
a high-level facet of our intelligence and constitute the basis for scientific theories and the creation of 
our civilization. They are assertions concerning the “why” ’s and the “how” ’s related to a task, e.g., 
[5,6,11]. 

To talk about AI, at least as it was conceived by pioneers such as Alan Turing [16], systems should 
have built-in capabilities for reasoning and giving meaning, e.g., in language processing, to be able 
to infer causality, to model efficient representations of uncertainty, and, also, to pursue long-term 
goals [8]. Possibly, towards achieving these challenging goals, we may have to understand and imple- 
ment notions from the theory of mind, and also build machines that implement self-awareness. The 
former psychological term refers to the understanding that others have their own beliefs and intentions 
that justify their decisions. The latter refers to what we call consciousness. As a last point, recall that 
human intelligence is closely related to feelings and emotions. As a matter of fact, the latter seem 
to play an important part in the Creative mental power of humans (e.g., [3,4,17]). Thus, in this more 
theoretical perspective AI stili remains a Vision for the future. 

The previous discussion should not be taken as an attempt to get involved with philosophical the¬ 
ories concerning the nature of human intelligence and AI. These topics comprise a field in itself, for 
more than 60 years, which is much beyond the scope of this book. My aim was to make the newcomer 
in the field aware of some views and concerns that are currently being discussed. 

In the more practical front, for the early years, the term AI was used to refer to techniques built 
around knowledge-based systems that sought to hard-code knowledge in terms of formal languages, 
e.g., [13]. Computer “reasoning” was implemented via a set of logical inference rules. In spite of the 
early successes, such methods seem to have reached a limit, see, e.g., [7]. It was the alternative path of 
machine learning, via learning from data, that gave a real push into the field. These days, the term AI 
is used as an umbrella to cover all methods and algorithmic approaches that are related to the machine 
intelligence discipline, with machine learning and knowledge-based techniques being parts of it. 


1.3 ALGORITHMS CAN LEARN WHAT IS HIDDEN IN THE DATA 

It has already been emphasized that data lie at the heart of machine learning systems. Data are the 
beginning. It is the information hidden in the data, in the form of underlying regularities, correlations. 




1.3 ALGORITHMS CAN LEARN WHAT IS HIDDEN IN THE DATA 


5 


or structure, which a machine learning system tries to “learn.” Thus, irrespective of how intelligent a 
Software algorithm is designed to be, it cannot leam more than what the data which it has been trained 
on allow. 

Collecting the data and building the data set on which an “intelligent” system is going to be trained 
is highly critical. Building data sets that address human needs and developing systems that are going 
to make decisions on issues, where humans and their lives are involved, requires special attention, and 
above ali, responsibility. This is not an easy issue and good intentions are not enough. We are ali prod- 
ucts of the societies in which we live, which means that our beliefs, to a large extent, are formed by the 
prevailing social stereotypes concerning, e.g., gender, racial, ethnic, religious, cultural, class-related, 
and political views. Most importantly, most of these beliefs take place and exist at a subconscious level. 
Thus, sampling “typical” cases to form data sets may have a strong flavor of subjectivity and introduce 
biases. A system trained on such data can affect lives, and it may take time for this to be found out. 
Furthermore, our world is fast changing and these changes should continuously be reflected in the Sys¬ 
tems that make decisions on our behalf. Outsourcing our lives to computers should be done cautiously 
and above ali in an ethical framework, which is much wider and general than the set of the existing 
legal rules. 

Of course, although this puts a burden on the shoulders of the individuals, governments, and com- 
panies that develop data sets and “intelligent” systems, it cannot be left to their good will. On the one 
hand, a specialized legal framework that guides the design, implementation, and use of such platforms 
and systems is required to protect our ethical standards and social values. Of course, this does not con- 
cern only the data that are collected but also the overall system that is built. Any system that replaces 
humans should be (a) transparent, (b) fair, and (c) accurate. Not that the humans act, necessarily, ac- 
cording to what the previous three terms mean. However, humans can, also, reason and discuss, we 
have feelings and emotions, and we do not just perform predictions. 

On the other hand, this may be the time when we can develop and build more “objective” systems; 
that is, to go beyond human subjectivity. However, such “objectivity” should be based on Science, 
rules, criteria, and principies, which are not yet here. As Michael Jordan [8] puts it, the development 
of such systems will require perspectives from the social Sciences and humanities. Currently, such 
systems are built following an ad hoc rather than a principled way. Karl Popper [12], one of the most 
influential philosophers of Science, stressed that all knowledge creation is theory-laden. Observations 
are never free of an underlying theory or explanation. Even if one believes that the process begins with 
observations, the act of observing requires a point ofview (see, also, [ 1 , 6 ]). 

If I take the liberty to make a bit of a Science fiction (something trendy these days), when AI, 
in the context of its original conception, is realized, then data sampling and creation of data sets for 
training could be taken care of by specialized algorithms. Maybe such algorithms will be based on 
scientific principies that in the meantime will have been developed. After all, this may be the time 
of dawn for the emergence of a new scientific/engineering field that integrates in a principle way 
data-focused disciplines. To this end, another statement of Karl Popper may have to be implemented, 
i.e., that of falsification, yet in a slightly abused interpretation. An emphasis on building intelligent 
systems should be directed on criticism and experimentations for finding evidence that refutes the 
principies, which were employed for their development. Systems can only be used if they survive the 
falsification test. 
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1.4 TYPICAL APPLICATIONS OF MACHINE LEARNING 

It is hard to find a discipline in which machine learning techniques for “learning” from data have not 
been applied. Yet, there are some areas, which can be considered as typical applications, maybe due to 
their economic and social impact. Examples of such applications are summarized below. 

SPEECH RECOGNITION 

Speech is the primary means of communication among humans. Language and speech comprise major 
attributes that differentiate humans from animals. Speech recognition has been one of the main research 
topics whose roots date back to the early 1960s. The goal of speech recognition is to develop methods 
and algorithms that enable the recognition and the subsequent representation in a computer of spoken 
language. This is an interdisciplinary field involving signal processing, machine learning, linguistics, 
and computer Science. 

Examples of speech recognition tasks and related systems that have been developed over the years 
range from the simplest isolated word recognition, where the speaker has to wait between utterances, 
to the more advanced continuous speech recognizers. In the latter, the user can speak almost naturally, 
and concurrently the computer can determine the content. Speaker recognition is another topic, where 
the system can identify the speaker. Such systems are used, for example, for security purposes. 

Speech recognition embraces a wide spectrum of applications. Some typical cases where speech 
recognizers have been used include automatic call processing in telephone networks, query-based in- 
formation systems, data entry, voice dictation, robotics, as well as assistive technologies for people 
with special needs, e.g., blind people. 

COMPUTER VISION 

This is a discipline that has been inspired by the human visual system. Typical tasks that are addressed 
within the computer vision community include the automatic extraction of edges from images, rep¬ 
resentation of objects as compositions of smaller structures, object detection and recognition, optical 
flow, motion estimation, inference of shape from various cues, such as shading and texture, and three- 
dimensional reconstruction of scenes from multiple images. Image morphing, that is, changing one 
image to another through a seamless transition, and image stitching, i.e., creating a panoramic image 
from a number of images, are also topics in the computer vision research. More recently, there is more 
and more interaction between the field of computer vision and that of graphics. 

MULTIMODAL DATA 

Both speech recognition and computer vision process information that originates from single modali- 
ties. However, humans perceive the natural world in a multimodal way, via their multiple senses, e.g., 
vision, hearing, and touch. There is complementary information in each one of the involved modalities 
that the human brain exploits in order to understand and perceive the surrounding world. 

Inspired by that, multimedia or multimodal understanding, via cross-media integration, has given 
birth to a related field whose goal is to improve the performance in the various scientific tasks that 
arise in problems that deal with multiple modalities. An example of modality blending is to combine 
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together image/video, speech/audio, and text. A related summary that also touches issues concerning 
the mental processes of human sensatiori, perception, and cognition is presented in [9]. 

NATURAL LANGUAGE PROCESSING 

This is the discipline that studies the processing of a language using computers. An example of a 
natural language processing (NLP) task is that of SPAM detection. Currently, the NLP field is an area 
of intense research with typical topics being the development of automatic translation algorithms and 
Software, sentiment analysis, text summarization, and authorship identification. Speech recognition 
has a strong affinity with NLP and, strictly speaking, could be considered as a special subtopic of it. 
Two case studies related to NLP are treated in the book, one in Chapter 11 concerning authorship 
identification and one in Chapter 18 related to neural machine translation (NMT). 

ROBOTICS 

Robots are used to perform tasks in the manufacturing industry, e.g., in an assembly line for car pro- 
duction, or by space agencies to move objects in the space. More recently, the so-called social robots 
are built to interact with people in their social environment. For example, social robots are used to 
benefit hospitalized children [10]. 

Robots have been used in situations that are difficult or dangerous for humans, such as bomb det- 
onation and work in difficult and hazardous environments, e.g., places of high heat, deep oceans, and 
areas of high radiation. Robots have also been developed for teaching. 

Robotics is an interdisciplinary field that, besides machine learning, includes disciplines such as 
mechanical engineering, electronic engineering, computer Science, computer vision, and speech recog¬ 
nition. 

AUT0N0M0US CARS 

An autonomous or self-driving car is a vehicle that can move around with no or little human in- 
tervention. Most of us have used self-driving trains in airports. However, these operate in a very 
well-controlled environment. Autonomous cars are designed to operate in the city streets and in mo- 
torways. This field is also of interdisciplinary nature, where areas such as radar, lidar, computer vision, 
automatic control, sensor networks, and machine learning meet together. It is anticipated that the use 
of self-driving cars will reduce the number of accidents, since, statistically, most of the accidents occur 
because of human errors, due to alcohol, high speed, stress, fatigue, etc. 

There are various levels of automation that one can implement. At level 0, which is the category in 
which most of the cars currently operate, the driver has the control and the automated built-in system 
may issue warnings. The higher the level, the more autonomy is present. For example, at level 4, the 
driver would be first notified whether conditions are safe, and then the driver can decide to switch 
the vehicle into the autonomous driving mode. At the highest level, level 5, the autonomous driving 
requires absolutely no human intervention [21]. 

Besides the aforementioned examples of notable machine learning applications, machine learning 
has been applied in a wide range of other areas, such as healthcare, bioinformatics, business, finance, 
education, law, and manufacturing. 
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CHALLENGES FOR THE FUTURE 

In spite of the impressive advances that have been achieved in machine learning, there are a number 
of challenges for the foreseeable future, besides the long-term ones that were mentioned before, while 
presenting AI. In the Berkeley report [18], the following list of challenges are summarized: 

• Designing systems that learn continually by interacting with a dynamic environment, while making 
decisions that are timely, robust, and secure. 

• Designing systems that enable personalized applications and Services, yet do not compromise users’ 
privacy and security. 

• Designing systems that can train on data sets owned by different organizations without compro- 
mising their confidentiality , and in the process provide AI capabilities that span the boundaries of 
potentially competing organizations. 

• Developing domain-specific architectures and Software systems to address the performance needs 
of future applications, including custom chips, edge-cloud systems to efficiently process data at the 
edge, and techniques for abstracting and sampling data. 

Besides the above more technology-oriented challenges, important social challenges do exist. The 
new technologies are influencing our daily lives more and more. In principle, they offer the potential, 
much more than ever before, to manipulate and shape beliefs, views, interests, entertainment, customs, 
and culture, independent of the societies. Moreover, they offer the potential for accessing personal data 
that in the sequel can be exploited for various reasons, such as economic, political, or other malicious 
purposes. As M. Schaake, a member of the European Parliament, puts it, “When algorithms affect hu- 
man rights, public values or public decision-making, we need oversight and transparency.” However, 
what was said before should not mobilize technophobic reactions. On the contrary, human civiliza- 
tion has advanced because of leaps in Science and technology. All that is needed is social sensitivity, 
awareness of the possible dangers, and a related legal “shielding.” 

Putting it in simple words, as Henri Bergson said [2], history is not deterministic. History is a 
Creative evolution. 


1.5 MACHINE LEARNING: MAJOR DIRECTIONS 

As has already been stated before, machine learning is the scientific held whose goal is to develop 
methods and algorithms that “learn” from data; that is, to extract information that “resides” in the data, 
which can subsequently be used by the computer to perform a task. To this end, the starting point is an 
available data set. Depending on the type of information that one needs to acquire, in the context of a 
specihc task, different types of machine learning have been developed. They are described below. 

1.5.1 SUPERVISED LEARNING 

Supervised learning refers to the type of machine learning where all the available data have been 
labeled. In other words, data are represented in pairs of observations , e.g., (y n . x n ), n = 1,2,..., N, 
where each x„ is a vector or, in general, a set of variables. The variables in x„ are called the input 
variables, also known as the independent variables or features, and the respective vector is known as 
th efeature vector. The variables y n are known as the output or dependent or target or label variables. 
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In some cases, y„ can also be a vector. The goal of learning is to obtain/estimate a functional mapping 
to, given the value of the input variables, predict the value of the respective output one. Two “pillars” 
of supervised learning are the classification and the regression tasks. 

Classification 

The goal in classification is to assign a pattern to one of a set of possible classes, whose number 
is considered to be known. For example, in X-ray mammography, we are given an image where a 
region indicates the existence of a tumor. The goal of a computer-aided diagnosis system is to predict 
whether this tumor corresponds to the benign or the malignant class. Optical character recognition 
(OCR) systems are also built around a classification system, in which the image corresponding to each 
letter of the alphabet has to be recognized and assigned to one of the 26 (for the Latin alphabet) classes; 
see Example 18.3 for a related case study. Another example is the prediction of the authorship of a given 
text. Given a text written by an unknown author, the goal of a classification system is to predict the 
author among a number of authors (classes); this application is treated in Section 11.15. The receiver 
in a digital Communications system can also be viewed as a classification system. Upon receiving the 
transmitted data, which have been contaminated by noise and also by other transformations imposed by 
the transmission channel (Chapter 4), the receiver has to reach a decision on the value of the originally 
transmitted symbols. For example, in a binary transmitted sequence, the original symbols belong either 
to the +1 or to the —1 class. This task is known as channel equalization. 

The first step in designing any machine learning task is to decide how to represent each pattern in the 
computer. This is achieved during the preprocessing stage; one has to “encode” related information that 
resides in the raw data (e.g., image pixels or strings of words) in an efficient and information-rich way. 
This is usually done by transforming the raw data into a new space and representing each pattern by a 
vector, x e K / . This comprises the feature vector and its / elements the corresponding feature values. 
In this way, each pattern becomes a single point in an /-dimensional space, known as th e feature space 
or the input space. We refer to this transformation of the raw data as the feature generation or feature 
extraction stage. One starts with generating some large value, K , of possible features and eventually 
selects the / most informative ones via an optimizing procedure known as the feature selection stage. 
As we will see in Section 18.12, in the context of convolutional neural networks, the previous two 
stages are merged together and the features are obtained and optimized in a combined way, together 
with the estimation of the functional mapping, which was mentioned before. 

Having decided upon the input space in which the data are represented, one has to train a classifier; 
that is, a predictor. This is achieved by first selecting a set of N data points/samples/examples, whose 
class is known, and this comprises the training set. This is the set of observation pairs, (y n , x n ), n = 
1,2 ,N, where y n is the (output) variable denoting the class in which x n belongs, and it is known 
as the corresponding class label\ the class labeis take values over a discrete set, e.g., {1,2,..., M}, 
for an M-class classification task. For example, for a two-class classification task, y„ e {—1, +1} or 
y n e {0, +1}. To keep our discussion simple, let us focus on the two-class case. Based on the training 
data, one then designs a function, /, which is used to predict the output label, given the input feature 
vector, x. In general, we may need to design a set of such functions. 

Once the function, /, has been designed, the system is ready to make predictions. Given a pattern 
whose class is unknown, we obtain the corresponding feature vector, x, from the raw data. Depending 
on the value of f(x), the pattern is classified in one of the two classes. For example, if the labeis 
take the values ±1, then the predicted label is obtained as y = sgn{ /'(x)}. This operation defines the 
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FIGURE 1.1 

The classifier (linear in this simple case) has been designed in order to separate the training data into two classes. 
The graph (straight line) of the linear function, f(x) = 0, has on its positive side the points coming from one class 
and on its negative side those of the other. The “red” point, whose class is unknown, is classified in the same class as 
the “star” points, since it lies on the positive side of the line. 


classifier. If the function / is linear (nonlinear) we say that the respective classification task is linear 
(nonlinear) or, in a slight abuse of terminology, that the classifier is a linear (nonlinear) one. 

Fig. 1.1 illustrates the classification task. Initially, we are given the set of points, each one repre- 
senting a pattern in the two-dimensional space (two features used, x\,X 2 ). Stars belong to one class, 
and the crosses to the other, in a two-class classification task. These are the training points, which are 
used to obtain a classifier. For our very simple case, this is achieved via a linear function, 

f(x) = 0 1*1+02*2 + 00. (1.1) 


whose graph, for ali the points such that f(x) = 0, is the straight line shown in the figure. The values 
of the parameters 61 , 62 , 60 are obtained via an estimation method based on the training set. This phase, 
where a classifier is estimated, is also known as the training or leaming phase. 

Once a classifier has been “learned,” we are ready to perform predictions, that is, to predict the 
class label of a pattern x. For example, we are given the point denoted by the red circle, whose class is 
unknown to us. According to the classification system that has been designed, this belongs to the same 
class as the points denoted by stars, which all belong to, say, class +1. Indeed, every point on one side 
of the straight line will give a positive value, f(x ) > 0, and all the points on its other side will give a 
negative value, f(x) < 0. The predicted label, y, for the point denoted with the red circle will then be 
y = sgn{/(x)J > 0, and it is classified in class +1, to which the star points belong. 

Our discussion to present the classification task was based on features that take numeric values. 
Classification tasks where the features are of categorical type do exist and are of major importance, 
too. 
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FIGURE 1.2 

In a regression task, once a function (linear in this case) / has been designed, for its graph to fit the available 
training data set, given a new (red) point, x, the prediction of the associated output (red) value is given by y = f(x). 


Regression 

Regression shares to a large extent the feature generation/selection stage, as described before; however, 
now the output variable, y, is not discrete, but it takes values in an interval in the real axis or in a region 
in the complex numbers’ plane. Generalizations to vector-valued outputs are also possible. Our focus 
here is on real variables. The regression task is basically a function (curve/surface) fitting problem. 

We are given a set of training samples, ( y n , x„), y„ e M, x n e n = 1,2,..., N, and the task is 
to estimate a function /, whose graph fits the data. Once we have found such a function, when a new 
sample x, outside the training set, arrives, we can predict its output value. This is shown in Fig. 1.2. 
The training data in this case are the gray points. Once the function fitting task has been completed, 
given a new point x (red), we are ready to predict its output value as y — f(x). In the simple case of 
the figure, the function / is linear and thus its graph is a straight line. 

The regression task is a generic one that embraces a number of problems. For example, in financial 
applications, one can predict tomorrow’s stock market prices given current market conditions and other 
related information. Each piece of information is a measured value of a corresponding feature. Signal 
and image restoration come under this common umbrella of tasks. Signal and image denoising can also 
be seen as a special type of the regression task. Deblurring of a blurred image can also be treated as 
regression (see Chapter 4). 


1.6 UNSUPERVISED AND SEMISUPERVISED LEARNING 

The goal of supervised learning is to establish a functional relationship between the input and output 
variables. To this end, labeled data are used, which comprise the set of output-input pairs, on which 
the learning of the unknown mapping is performed. 
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In the antipode of supervised learning lies unsupervised leaming, where only input variables are 
provided. No output or label information is available. The aim of unsupervised learning is to unravel 
the structure that underlies the given set of data. This is an important part in data learning methods. 
Unsupervised learning comes under a number of facets. 

One of the most important types of unsupervised learning is that of clustering. The goal of any 
clustering task is to unravel the way in which the points in a data set are grouped assuming that such a 
group structure exists. As an example, given a set of newspaper articles, one may want to group them 
together according to how similar their content is. As a matter of fact, at the heart of any clustering 
algorithm lies the concept of similarity, since patterns that belong to the same group (cluster) are 
assumed to be more similar than patterns that belong to different clusters. 

One of the most classical clustering schemes, the so-called £-means clustering, is presented and 
discussed in Section 12.6.1. However, clustering is not a main topic of this book, and the interested 
reader may look at more specialized references (e.g., [14,15]). 

Another type of unsupervised learning is dimensionality reduction. The goal is also to reveal a 
particular structure of the data, which is of a different nature than that of the groupings. For ex¬ 
ample, although the data may be represented in a high-dimensional space, they may lie around a 
lower-dimensional subspace or a manifold. Such methods are very important in machine learning for 
compressed representations or computational reduction reasons. Dimensionality reduction methods are 
treated in detail in Chapter 19. 

Probability distribution estimation can also be considered as a special case of unsupervised learn¬ 
ing. Probabilistic modeling is treated extensively in Chapters 12, 13, 15, and 16. 

More recently, unsupervised learning is used for data generation. The so-called generative adversar- 
ial networks (GANs) comprise a new way of dealing with this old topic, by employing game theoretic 
arguments, and they are treated in Chapter 18. 

Semisupervised lies in between supervised and unsupervised learning. In semisupervised learning, 
there are labeled data but not enough to get a good estimate of the output-input dependence. The 
existence of a number of unlabeled patterns can assist the task, since it can reveal additional structure 
of the input data that can be efficiently utilized. Semisupervised learning is treated in, e.g., [14]. 

Finally, another type of learning, which is increasingly gaining in importance, is the so-called rein- 
forcement learning (RL). This is also an old field, with origins in automatic control. At the heart of this 
type of learning lies a set of rules and the goal is to learn sequences of actions that will lead an agent 
to achieve its goal or to maximize its objective function. For example, if the agent is a robot, the goal 
may be to move from point A to point B. Intuitively, RL attempts to learn actions by fria/ and error. 
In contrast to supervised learning, optimal actions are not learned from labeis but from what is known 
as a reward. This scalar value informs the system whether the outcome of whatever it did was right or 
wrong. Taking actions that maximize the reward is the goal of RL. 

Reinforcement learning is beyond the scope of this book and the interested reader may consuit, e.g., 

[19]. 


1.7 STRUCTURE AND A ROAD MAP OF THE BOOK 

In the discussion above, we saw that seemingly different applications, e.g., authorship identihcation and 
channel equalization, as well as financial prediction and image deblurring, can be treated in a unihed 
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framework. Many of the techniques that have been developed for machine learning are no different than 
techniques used in statistical signal processing or adaptive signal processing. Filtering comes under the 
general framework of regression (Chapter 4), and “adaptive filtering” is the same as “online learning” 
in machine learning. As a matter of fact, as will be explained in more detail, this book can serve the 
needs of more than one advanced graduate or postgraduate courses. 

Over the years, a large number of techniques have been developed, in the context of different appli- 
cations. Most of these techniques belong to one of two schools of thought. In one of them, the involved 
parameters that define an unknown function, for example, 0\, 6b, 0() in Eq. (1.1), are treated as random 
variables. Bayesian learning builds upon this rationale. Bayesian methods learn distributions that de¬ 
scribe the randomness of the involved parameters/variables. According to the other school, parameters 
are treated as nonrandom variables. They correspond to a fixed, yet unknown value. We will refer to 
such parameters as deterministic. This term is justified by the fact that, in contrast to random variables, 
if the value of a nonrandom variable is known, then its value can be “predicted” exactly. Learning 
methods that build around deterministic variables focus on optimization techniques to obtain estimates 
of the corresponding values. In some cases, the term “frequentist” is used to describe the latter type of 
techniques (see Chapter 12). 

Each of the two previous schools of thought has its pros and cons, and I firmly believe that there 
is always more than one road that leads to the “truth.” Each can solve some problems more efficiently 
than the other. Maybe in a few years, the scene will be more ciear and more definite conclusions can 
be drawn. Or it may turn out, as in life, that the “truth” is somewhere in the middle. 

In any case, every newcomer to the field has to learn the basies and the classics. That is why, in this 
book, ali major directions and methods will be discussed, in an equally balanced manner, to the greatest 
extent possible. Of course, the author, being human, could not avoid giving some more emphasis to 
the techniques with which he is most familiar through his own research. This is healthy, since writing 
a book is a means of sharing the author’s expertise and point of view with the readers. This is why I 
strongly believe that a new book does not serve to replace previous ones, but to complement previously 
published points of view. 

Chapter 2 is an introduction to probability and statisties. Random processes are also discussed. 
Readers who are familiar with such concepts can bypass this chapter. On the other hand, one can 
focus on different parts of this chapter. Readers who would like to focus on statistical signal process- 
ing/adaptive processing can focus more on the random processes part. Those who would like to follow 
a probabilistic machine learning point of view would find the part presenting the various distributions 
more important. In any case, the multivariate normal (Gaussian) distribution is a must for those who 
are not yet familiar with it. 

Chapter 3 is an overview of the parameter estimation task. This is a chapter that presents an 
overview of the book and defines the main concepts that run across its pages. This chapter has also 
been written to stand alone as an introduction to machine learning. Although it is my feeling that ali 
of it should be read and taught, depending on the focus of the course and taking into account the om- 
nipresent time limitations, one can focus more on the parts of her or his interest. Least-squares and 
ridge regression are discussed alongside the maximum likelihood method and the presentation of the 
basic notion of the Bayesian approach. In any case, the parts dealing with the definition of the inverse 
problems, the bias-variance tradeoff, and the concepts of generalization and regularization are a must. 

Chapter 4 is dedicated to the mean-square error (MSE) linear estimation. For those following a 
statistical signal processing course, the whole chapter is important. The rest of the readers can bypass 
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the parts related to complex-valued processing and also the part dealing with computational complexity 
issues, since this is only of importance if the input data are random processes. Bypassing this part will 
not affect reading later parts of the chapter that deal with the MSE estimation of linear models, the 
Gauss-Markov theorem, and the Kalman filtering. 

Chapter 5 introduces the stochastic gradient descent family of algorithms. The first part, dealing 
with the stochastic approximation method, is a must for every reader. The rest of the chapter, which 
deals with the least-mean-squares (LMS) algorithm and its offsprings, is more appropriate for readers 
who are interested in a statistical signal processing course, since these families are suited for track- 
ing time-varying environments. This may not be the first priority for readers who are interested in 
classification and machine learning tasks with data whose statistical properties are not time-varying. 

Chapter 6 is dedicated to the least-squares (LS) method, which is of interest to ali readers in ma¬ 
chine learning and signal processing. The latter part, dealing with the total least-squares method, can 
be bypassed in a first reading. Emphasis is also put on ridge regression and its geometric interpretation. 
Ridge regression is important to the newcomer, since he/she becomes familiar with the concept of regu- 
larization; this is an important aspect in any machine learning task, tied directly with the generalization 
performance of the designed predictor. 

1 have decided to compress the part dealing with fast LS algorithms, which are appropriate when the 
input is a random process/signal that imposes a special structure on the involved covariance matrices, 
into a discussion section. It is the author’s feeling that this is of no greater interest than it was a decade 
or two ago. Also, the main idea, that of a highly structured covariance matrix that lies behind the fast 
algorithms, is discussed in some detail in Chapter 4, in the context of Levinson’s algorithm and its 
lattice and lattice-ladder by-products. 

Chapter 7 is a must for any machine learning course. Important classical concepts, including classi¬ 
fication in the context of the Bayesian decision theory, nearest neighbor classifiers, logistic regression, 
Fisher's discriminant analysis and decision trees are discussed. Courses on statistical signal processing 
can also accommodate the first part of the chapter dealing with the classical Bayesian decision theory. 

The aforementioned six chapters comprise the part of the book that deals with more or less classical 
topics. The rest of the chapters deal with more advanced techniques and can fit with any course deal¬ 
ing with machine learning or statistical/adaptive signal processing, depending on the focus, the time 
constraints, and the background of the audience. 

Chapter 8 deals with convexity, a topic that is receiving more and more attention recently. The 
chapter presents the basic definitions concerning convex sets and functions and the notion of projec- 
tion. These are important tools used in a number of recently developed algorithms. Also, the classical 
projections onto convex sets (POCS) algorithm and the set-theoretic approach to Online learning are 
discussed as an alternative to gradient descent-based schemes. Then, the task of optimization of nons- 
mooth convex loss functions is introduced, and the family of proximal mapping, altemating direction 
method of multipliers (ADMM), and forward backward-splitting methods are presented. This is a chap¬ 
ter that can be used when the emphasis of the course is on optimization. Employing nonsmooth loss 
functions and/or nonsmooth regularization terms, in place of the squared error and its ridge regression 
relative, is a trend of high research and practical interest. 

Chapters 9 and 10 deal with sparse modeling. The first of the two chapters introduces the main 
concepts and ideas and the second deals with algorithms for batch as well for as Online learning sce- 
narios. Also, in the second chapter, a case study in the context of time-frequency analysis is discussed. 
Depending on time constraints, the main concepts behind sparse modeling and compressed sensing can 
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be taught in a related course. These two chapters can also be used as a specialized course on sparsity 
on a postgraduate level. 

Chapter 1 1 deals with learning in reproducing kernel Hilbert spaces and nonlinear techniques. The 
first part of the chapter is a must for any course with an emphasis on classification. Support vector 
regression and support vector machines are treated in detail. Moreover, a course on statistical signal 
Processing with an emphasis on nonlinear modeling can also include material and concepts from this 
chapter. A case study dealing with authorship identification is discussed at the end of this chapter. 

Chapters 12 and 13 deal with Bayesian learning. Thus, both chapters can be the backbone of 
a course on machine learning and statistical signal processing that intends to emphasize Bayesian 
methods. The former of the chapters deals with the basic principies and it is an introduction to the 
expectation-maximization (EM) algorithm. The use of this celebrated algorithm is demonstrated in 
the context of two classical applications: linear regression and Gaussian mixture modeling for prob- 
ability density function estimation. The second chapter deals with approximate inference techniques, 
and one can use parts of it, depending on the time constraints and the background of the audience. 
Sparse Bayesian learning and the relevance vector machine (RVM) framework are introduced. At the 
end of this chapter, nonparametric Bayesian techniques such as the Chinese restaurant process (CRP), 
the Indian buffet process (IBP), and Gaussian processes are discussed. Finally, a case study concern- 
ing hyperspectral image unmixing is presented. Both chapters, in their full length, can be used as a 
specialized course on Bayesian learning. 

Chapters 14 and 17 deal with Monte Carlo sampling methods. The latter chapter deals with particle 
filtering. Both chapters, together with the two previous ones that deal with Bayesian learning, can be 
combined in a course whose emphasis is on statistical methods of machine learning/statistical signal 
processing. 

Chapters 15 and 16 deal with probabilistic graphical models. The former chapter introduces the 
main concepts and definitions, and at the end it introduces the message passing algorithm for chains 
and trees. This chapter is a must for any course whose emphasis is on probabilistic graphical models. 
The latter of the two chapters deals with message passing algorithms on junction trees and then with 
approximate inference techniques. Dynamic graphical models and hidden Markov models (HMMs) are 
introduced. The Baum-Welch and Viterbi schemes are derived as special cases of message passaging 
algorithms by treating the HMM as a special instance of a junction tree. 

Chapter 18 deals with neural networks and deep learning. In the second edition, this chapter has 
been basically rewritten to accommodate advances in this topic that have taken place after the first 
edition was published. This chapter is also a must in any course with an emphasis on classification. 
The chapter starts from the early days of the perceptron algorithm and perceptron rule and moves on 
to the most recent advances in deep learning. The feed-forward multilayer architecture is introduced 
and a number of stochastic gradient-type algorithmic variants, for training networks, are presented. 
Different types of nonlinearities and cost functions are discussed and their interplay with respect to the 
vanishing/exploding phenomenon in the gradient propagation are presented. Regularization and the 
dropout method are discussed. Convolutional neural networks (CNNs) and recurrent neural networks 
(RNNs) are reviewed in some detail. The notions of attention mechanism and adversarial examples 
are presented. Deep belief networks, GANs, and variational autoencoders are considered in some de¬ 
tail. Capsule networks are introduced and, at the end, a discussion on transfer learning and multitask 
learning is provided. The chapter concludes with a case study related to neural machine translation 
(NMT). 
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Chapter 19 is on dimensionality reduction techniques and latent variable modeling. The meth- 
ods of principal component analysis (PCA), canonical correlations analysis (CCA), and independent 
component analysis (ICA) are introduced. The probabilistic approach to latent variable modeling is 
discussed, and the probabilistic PCA (PPCA) is presented. Then, the focus turns to dictionary learning 
and to robust PCA. Nonlinear dimensionality reduction techniques such as kernel PCA are discussed, 
along with classical manifold learning methods: local linear embedding (LLE) and isometric mapping 
(ISOMAP). Finally, a case study in the context of functional magnetic resonance imaging (fMRI) data 
analysis, based on ICA, is presented. 

Each chapter starts with the basies and moves on to cover more recent advances in the corresponding 
topic. This is also true for the whole book and the first six chapters cover more classical material. 

In summary, we provide the following suggestions for different courses, depending on the emphasis 
that the instructor wants to place on various topies. 

• Machine learning with emphasis on classification: 

- Main chapters: 3, 7, 11, and 18. 

- Secondary chapters: 12 and 13, and possibly the first part of 6. 

• Statistical signal processing: 

- Main chapters: 3, 4, 6, and 12. 

- Secondary chapters: 5 (first part) and 13-17. 

• Machine learning with emphasis on Bayesian techniques: 

- Main chapters: 3 and 12-14. 

- Secondary chapters: 7, 15, and 16, and possibly the first part of 6. 

• Adaptive signal processing: 

- Main chapters: 3-6. 

- Secondary chapters: 8, 9, 10, 11, 14, and 17. 

I believe that the above suggestions of following various combinations of chapters is possible, since 
the book has been written in such a way as to make individual chapters as self-contained as possible. 

At the end of most of the chapters, there are computer exercises, rnainly based on the various 
examples given in the text. The exercises are given in MATLAB® and the respective code is available 
on the book’s website. Moreover, ali exercises are provided, together with respective codes, in Python 
and are also available on the book’s website. Some of the exercises in Chapter 18 are in the context of 
TensorFlow. 

The Solutions manual as well as all the figures of the book are available on the book’s website. 


REFERENCES 

[1] R. Bajcsy, Active perception, Proceedings of the IEEE 76 (8) (1988) 966-1005. 

[2] H. Bergson, Creative Evolution, McMillan, London, 1922. 

[3] A. Damasio, Descartes’ Error: Emotion, Reason, and the Human Brain, Penguin, 2005 (paperback reprint). 

[4] S. Dehaene, H. Lau, S. Kouider, What is consciousness, and could machines have it?, Science 358 (6362) (2017) 486-492. 

[5] D. Deutsch, The Fabric of Reality, Penguin, 1998. 

[6] D. Deutsch, The Beginning of Infinity: Explanations That Transform the World, Allen Lane, 2011. 

[7] H. Dreyfus, What Computers Stili Can’t Do, M.I.T. Press, 1992. 

[8] M. Jordan, Artificial intelligence: the revolution hasn’t happened yet, https://medium.eom/@ mijordan3/artificial- 
intelligence- the- revolution- hasnt- happened- yet- 5 e 1 d5 812e 1 e7 , 2018. 




REFERENCES 


17 


[9] P. Maragos, P. Gros, A. Katsamanis, G. Papandreou, Cross-modal integration for performance improving in multimedia: a 
review, in: P. Maragos, A. Potamianos, P. Gros (Eds.), Multimodal Processing and Interaction: Audio, Video, Text, Springer, 
2008. 

[10] MIT News, Study: social robots can benefit hospitalized children, http://news.mit.edu/2019/social-robots-benefit-sick- 
children-0626. 

[11] J. Pearl, D. Mackenzie, The Book of Why: The New Science of Cause and Effect, Basic Books, 2018. 

[12] K. Popper, The Logic of Scientific Discovery, Routledge Classics, 2002. 

[13] S. Russell, P. Norvig, Artificial Intelligence, third ed., Prentice Hali, 2010. 

[14] S. Theodoridis, K. Koutroumbas, Pattem Recognition, fourth ed., Academic Press, Amsterdam, 2009. 

[15] S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction to Pattern Recognition: A MATLAB Approach, 
Academic Press, Amsterdam, 2010. 

[16] A. Turing, Computing machinery intelligence, MIND 49 (1950) 433^-60. 

[17] N. Schwarz, I. Skumik, Feeling and thinking: implications for problem solving, in: J.E. Davidson, R. Stemberg (Eds.), The 
Psychology of Problem Solving, Cambridge University Press, 2003, pp. 263-292. 

[18] I. Stoica, et al., A Berkeley View of Systems Challenges for AI, Technical Report No. UCB/EECS-2017-159, 2017. 

[19] R.C. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2018. 

[20] The history of artificial intelligence, University of Washington, https://courses.cs.washington.edu/courses/csep590/06au/ 
projects/history- ai.pdf. 

[21] USA Department of Transportation Report, https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/13069a_ads2.0- 
09061 79a_tag.pdf. 

[22] The Changing Nature of Work, World Bank Report, 2019, http://documents.worldbank.org/curated/en/ 
816281518818814423/pdf/201 9-WDR-Report.pdf. 

[23] World Economic Forum, The future of jobs, http://www3.weforum.org/docs/WEF_Future_of_Jobs_2018.pdf, 2018. 



PROBABILITY AND STOCHASTIC 
PROCESSES 


CHAPTER 



CONTENTS 

2.1 Introduction . 20 

2.2 Probability and Random Variables . 20 

2.2.1 Probability . 20 

Relative Frequency Definitiori . 21 

Axiomatic Definition . 21 

2.2.2 Discrete Random Variables . 22 

Joint and Conditional Probabilities . 22 

Bayes Theorem . 23 

2.2.3 Continuous Random Variables . 24 

2.2.4 Mean and Variance . 25 

Complex Random Variables . 27 

2.2.5 Transformation of Random Variables . 28 

2.3 Examples of Distributions . 29 

2.3.1 Discrete Variables . 29 

The Bernoulli Distribution . 29 

The Binomial Distribution . 30 

The Multinomial Distribution . 31 

2.3.2 Continuous Variables . 32 

The Uniform Distribution . 32 

The Gaussian Distribution . 32 

The Central Limit Theorem . 36 

The Exponentiai Distribution . 37 

The Beta Distribution . 37 

The Gamma Distribution . 38 

The D irio hiet Distribution . 39 

2.4 Stochastic Processes . 41 

2.4.1 First- and Second-Order Statistics . 42 

2.4.2 Stationarity and Ergodicity . 43 

2.4.3 Power Spectral Density . 46 

Properties of the Autocorrelation Sequence . 46 

Power Spectral Density . 47 

Transmission Through a Linear System . 48 

Physical Interpretation of the PSD . 50 

2.4.4 Autoregressive Models . 51 

2.5 Information Theory . 54 

2.5.1 Discrete Random Variables . 56 

Information . 56 


Machine Learning. https://doi.org/10.1016/B978-0-12-818803-3.00011-8 
Copyright © 2020 Elsevier Ltd. All rights reserved. 


19 










































20 


CHAPTER 2 PROBABILITY AND STOCHASTIC PROCESSES 


Mutual and Conditional Information . 56 

Entropy and Average Mutual Information . 58 

2.5.2 Continuous Random Variables . 59 

Average Mutual Information and Conditional Information . 61 

Relative Entropy or Kullback-Leibler Divergence . 61 

2.6 Stochastic Convergence . 61 

Convergence Everywhere . 62 

Convergence Almost Everywhere . 62 

Convergence in the Mean-Square Sense . 62 

Convergence in Probability . 63 

Convergence in Distribution . 63 

Problems . 63 

References . 65 


2.1 INTRODUCTION 

The goal of this chapter is to provide the basic definitions and properties related to probability theory 
and stochastic processes. It is assumed that the reader has attended a basic course on probability and 
statistics prior to reading this book. So, the aim is to help the reader refresh her/his memory and to 
establish a common language and a commonly understood notation. 

Besides probability and random variables, random processes are briefly reviewed and some basic 
theorems are stated. A number of key probability distributions that will be used later on in a number 
of chapters are presented. Finally, at the end of the chapter, basic definitions and properties related to 
information theory and stochastic convergence are summarized. 

The reader who is familiar with all these notions can bypass this chapter. 


2.2 PROBABILITY AND RANDOM VARIABLES 

A random variable, x, is a variable whose variations are due to chance/randomness. A random variable 
can be considered as a function, which assigns a value to the outcome of an experiment. For example, 
in a coin tossing experiment, the corresponding random variable, x, can assume the values x\ — 0 if 
the resuit of the experiment is “heads” and xi = 1 if the resuit is “tails.” 

We will denote a random variable with a lower case roman, such as x, and the values it takes once 
an experiment has been performed, with mathmode italics, such as x. 

A random variable is described in terms of a set of probcibilities if its values are of a discrete nature, 
or in terms of a probability density function (PDF) if its values lie anywhere within an interval of the 
real axis (noncountably infinite set). For a more formal treatment and discussion, see [4,6]. 

2.2.1 PROBABILITY 

Although the words “probability” and “probable” are quite common in our everyday vocabulary, the 
mathematical definition of probability is not a straightforward one, and there are a number of different 
definitions that have been proposed over the years. Needless to say, whatever definition is adopted, 
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the end resuit is that the properties and rules which are derived remain the same. Two of the most 
commonly used definitions are the following 

Relative Frequency Definitiori 

The probability P (A) of an event A is the limit 


P(A) = lim —, (2.1) 

n —>oo n 

where n is the number of total trials and n A the number of times event A occurred. The problem with 
this definition is that in practice in any physical experiment, the numbers «a and n can be large, yet 
they are always finite. Thus, the limit can only be used as a hypothesis and not as something that can 
be attained experimentally. In practice, often, we use 

n A 

P{A) ^ — (2.2) 

n 

for large values of n. However, this has to be used with caution, especially when the probability of an 
event is very small. 

Axiomatic Definition 

This definition of probability is traced back to 1933 to the work of Andrey Kolmogorov, who found 
a close connection between probability theory and the mathematical theory of sets and functions of a 
real variable, in the context of measure theory, as noted in [5]. 

The probability P(A) of an event is a nonnegative number assigned to this event, or 

P(A) > 0. (2.3) 

The probability of an event C which is certain to occur is equal to one, i.e., 

P(C) = 1. (2.4) 

If two events A and B are mutually exclusive (they cannot occur simultaneously), then the probability 
of occurrence of either A or B (denoted as A U B) is given by 

P(AU B) = P(A)+ P(B). (2.5) 

It turns out that these three defining properties, which can be considered as the respective axioms, suf¬ 
fice to develop the rest of the theory. For example, it can be shown that the probability of an impossible 
event is equal to zero, e.g., [6]. 

The previous two approaches for defining probability are not the only ones. Another interpretation, 
which is in line with the way we are going to use the notion of probability in a number of places in this 
book in the context of Bayesian leaming, has been given by Cox [2]. There, probability was seen as 
a measure of uncertciinty concerning an event. Take, for example, the uncertainty whether the Minoan 
civilization was destroyed as a consequence of the earthquake that happened close to the island of 
Santorini. This is obviously not an event whose probability can be tested with repeated trials. However, 
putting together historical as well as scientific evidence, we can quantify our expression of uncertainty 
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concerning such a conjecture. Also, we can modify the degree of our uncertainty once more historical 
evidence comes to light due to new archeological findings. Assigning numerical values to represent 
degrees of belief, Cox developed a set of axioms encoding common sense properties of such beliefs, 
and he came to a set of rules equivalent to the ones we are going to review soon; see also [4]. 

The origins of probability theory are traced back to the middle 17th century in the works of Pierre 
Fermat (1601-1665), Blaise Pascal (1623-1662), and Christian Huygens (1629-1695). The concepts 
of probability and the mean value of a random variable can be found there. The original motivation for 
developing the theory seems not to be related to any purpose for “serving society”; the purpose was to 
serve the needs of gambling and games of chance! 


2.2.2 DISCRETE RANDOM VARIABLES 

A discrete random variable x can take any value from a finite or countably infinite set X. The proba¬ 
bility of the event, “x = x e X” is denoted as 


P(\ — x) or simply P(x). 


(2.6) 


The function P is known as the probability mass function (PMF). Being a probability of events, it has 
to satisfy the first axiom, so P(x) > 0. Assuming that no two values in X can occur simultaneously 
and that after any experiment a single value will always occur, the second and third axioms combined 
give 



(2.7) 


xeX 

The set X is also known as the sample or state space. 

Joint and Conditional Probabilities 

The joint probability of two events, A, B, is the probability that both events occur simultaneously, 
and it is denoted as P(A, B). Let us now consider two random variables, x, y, with sample spaces 
X — {xi,, x„ x ) and y = {yi,..., y„ y ), respectively. Let us adopt the relative frequency definition 
and assume that we carry out n experiments and that each one of the values in X occurred n x n 

times and each one of the values in y occurred n J,..., n y Hy times. Then, 



n n 


Let us denote by n (/ the number of times the values v; and yj occurred simultaneously. Then, 
P(xi, y /) ~ —. Simple reasoning dictates that the total number, n x , that value v,- occurred is equal to 



( 2 . 8 ) 


Dividing both sides in the above by n. the following sum rule readily results. 
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P(x ) = Y' P(x, y) : sumrule. 

3-eT 


(2.9) 


The conditionalprobability of an event A, given another event B , is denoted as P(A\B), and it is 
defined as 


P{A\B) := 


P(A, B) 

P(B) 


conditional probability, 


( 2 . 10 ) 


provided P(B) 0. It can be shown that this is indeed a probability, in the sense that it respects all three 
axioms [6]. We can better grasp its physical meaning if the relative frequency definition is adopted. Let 
yiab be the number of times that both events occurred simultaneously, and let iib be the number of 
times event B occurred, out of n experiments. Then we have 


P{A\B) = 


n A B n 
n hb 


nAB 

IIB 


( 2 . 11 ) 


In other words, the conditional probability of an event A, given another event B, is the relative fre¬ 
quency that A occurred, not with respect to the total number of experiments performed, but relative to 
the times event B occurred. 

Viewed differently and adopting similar notation in terms of random variables, in conformity with 
Eq. (2.9), the definition of the conditional probability is also known as the product rule of probability, 
written as 


P(x, y) — P(x\y)P(y) : product rule. 


( 2 . 12 ) 


To differentiate from the joint and conditional probabilities, probabilities P(x) and P(y) are known as 
marginal probabilities. The product rule is generalized in a straightforward way to / random variables, 
i.e., 


P(x 1 , x 2 , ..., xi) = P(xi\xi-u ..., *i)P0c/_i,..., -Vi), 
which recursively leads to the product 

P{x 1 ,X 2 , ...,X/) = P(xi\xi-l, X\)P(xi-\\xi- 2 , P(x\). 

Statistical independence : Two random variables are said to be statistically independent if and only 
if their joint probability is equal to the product of the respective marginal probabilities, i.e., 

P(x,y) = P(x)P(y). (2.13) 


Bayes Theorem 

The Bayes theorem is a direct consequence of the product rule and the symmetry property of the joint 
probability, P(x, y) = P(y,x), and it is stated as 


P{y\x) 


P(x\y)P(y) _ 
P(x) 


Bayes theorem. 


(2.14) 
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where the marginal, P(x), can be written as 

Pix) = Y p ix, y) = Y Pix\y) p iy), 

y^y yey 

and it can be considered as the normalizing constant of the numerator on the right-hand side in 
Eq. (2.14), which guarantees that summing up P(y\x) with respect to all possible values of y e y 
results in one. 

The Bayes theorem plays a Central role in machine learning, and it will be the basis for developing 
Bayesian techniques for estimating the values of unknown parameters. 

2.2.3 CONTINUOUS RANDOM VARIABLES 

So far, we have focused on discrete random variables. Our interest now turns to the extension of the 
notion of probability to random variables which take values on the real axis, R. 

The starting point is to compute the probability of a random variable, x, to lie in an interval, 
xi < x < X 2 ■ Note that the two events, x < xi and x\ < x < x 2 , are mutually exclusive. Thus, we can 
write that 


Pix < X\ ) + P(x 1 < X < X 2 ) = Pix < Xi)- 
We define the cumulative distribution function (CDF) of x as 


F x (x):= P(x < x ): cumulative distribution function. 
Then, Eq. (2.15) can be written as 

Pix i < x < x 2 ) = Fxix 2 ) - F x (*i). 


(2.15) 

(2.16) 


(2.17) 


Note that F x is a monotonically increasing function. Furthermore, if it is continuous, the random vari¬ 
able x is said to be of a continuous type. Assuming that it is also differentiable, we can define the 
probability density function (PDF) of x as 


Px(x) := 


dFxjx) _ 
dx 


probability density function, 


(2.18) 


which then leads to 


Also, 


rx 2 

P(x\ < x < X 2 ) = / p x (x)dx. 
Jx\ 


Fxix)= f Pxiz)dz. 

J —OO 


Using familiar arguments from calculus, the PDF can be interpreted as 


A P{x < x < x + Ax) & Pxix) Ax, 


(2.19) 


( 2 . 20 ) 


(2.21) 
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which justifies its name as a “density” function, being the probability (A P) of x lying in a small 
interval Ax, divided by the length of this interval. Note that as Ax —> 0 this probability tends to zero. 
Thus, the probability of a continuous random variable taking any single value is zero. Moreover, since 
P(—oo < x < +oo) = 1, we have 

/ +oo 

p x (x)dx= 1. (2.22) 

-OO 

Usually, in order to simplify notation, the subscript x is dropped and we write pix), unless it is 
necessary for avoiding possible confusion. Note, also, that we have adopted the lower case “p” to 
denote a PDF and the capital “P” to denote a probability. 

All previously stated rules concerning probabilities are readily carried out for the case of PDFs, in 
the following way: 

pix, v) f + °° 

P(x\y )=———, p(x)= pix, y)dy. (2.23) 

piy) J-o o 


2.2.4 MEAN AND VARIANCE 

Two of the most common and useful quantities associated with any random variable are the respective 
mean value and variance. The mean value (or sometimes called expected value) is denoted as 


E[x] := 



xpix) dx : 


mean value, 


(2.24) 


where for discrete random variables the integration is replaced by summation (E[x] = ^2 xe x xP(x)). 
The variance is denoted as er~ and it is defined as 



(x — E [x]) 2 p(x) dx : 


variance. 


(2.25) 


where integration is replaced by summation for discrete variables. The variance is a measure of the 
spread of the values of the random variable around its mean value, 

The definition of the mean value is generalized for any function /(x), i.e., 

/ +oo 

f{x)p{x)dx. (2.26) 

-OO 

It is readily shown that the mean value with respect to two random variables, y, x, can be written as 
the product 

Ex,y[/(X, y)] = Ex [Ey| X [/(x, y)]], (2.27) 

where E y | r denotes the mean value with respect to p(y\x). This is a direct consequence of the definition 
of the mean value and the product rule of probabilities. 
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Given two random variables x, y, their covaricince is defined as 

cov(x, y) : = E [(x - E[x]) (y - E[y])], (2.28) 

and their correlation as 

r xy := E[xy] = cov(x, y) + E[x] E[y]. (2.29) 

A random vector is a collection of random variables, x = [xi,..., x/] r , and p(\) is the joint PDF 
(probability mass for discrete variables), 


P(x) = P(x t,...,x/). 


(2.30) 


The covariance matrix of a random vector x is defined as 


Cov(x) :=e[(x-E[x])(x-E[x]) 7 ’] : 


covariance matrix, 


(2.31) 


or 


COV(Xi,Xi) ... COV(Xi,X/) 


Cov(x) = 


(2.32) 


_COV(x/,Xl) ... cov(x;,x/)_ 

Another Symbol that will be used to denote the covariance matrix is E x . Similarly, the correlation 
matrix of a random vector x is defined as 


or 



correlation matrix, 


(2.33) 


E[Xi,Xi] ... E[X!,X/] 


_E[x/,xi] ... E[x/,x/]_ 


= Cov(x) + E[x] E[x 7 ]. 


(2.34) 


Often, subscripts are dropped in other to simplify notation, and the corresponding symbols, e.g., r, 
E and R are used instead, unless it is necessary to avoid confusion when different random variables 
are involved. Both the covariance and correlation matrices have a very rich structure, which will be 
exploited in various parts of this book to lead to computational savings whenever they are present 
in calculations. For the time being, observe that both are symmetric and positive semidefinite. The 


1 Note that in the subsequent chapters, to avoid having bold letters in subscripts, which can be cumbersome when more that 
one vector variables are involved, the notation has been slightly relaxed. For example, R x is used in place of R x . For the sake of 
uniformity, the same applies to the rest of the variables corresponding to correlations, variances and covariance matrices. 
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symmetry, E — E T , is readily deduced from the definition. An / x / symmetric matrix A is called 
positive semidefinite if 

y T Ay> 0, WyeU 1 . (2.35) 

If the inequality is a striet one, the matrix is said to be positive definite. For the covariance matrix, we 
have 

y 7 ’E[(x-E[x])(x-E[x]) 7 ']y = E[(y r (x - E[x])) 2 ] > 0, 
and the claim has been proved. 

Complex Random Vari abies 

A complex random variable, z e C, is a sum 


z = x + /y, (2.36) 

where x, y are real random variables and j V—1- Note that for complex random variables, the PDF 
cannot be defined since inequalities of the form x + jy < x + jy have no meaning. When we write 
p(z), we mean the joint PDF of the real and imaginary parts, expressed as 

p(z) '■= pix, y). (2.37) 

For complex random variables, the notions of mean and covariance are defined as 

E[z] := E[x] + j E[y], (2.38) 


and 


cov(zi,z 2 ) :=E[(zi - E[zi])(z 2 -E[z 2 ]) "] , (2.39) 

where denotes complex conjugation. The latter definition leads to the variance of a complex vari¬ 
able, 


er? = E 


:-E[z]| 2 ]=e[|z| 2 ]-|E[z] 


Similarly, for complex random vectors, z = x + j y e C / , we have 


(2.40) 


p(z) \= p{x\,...,xi,y\,...,yi). 


(2.41) 


where xt, _v/. i = 1.2,...,/, are the components of the involved real vectors, respectively. The covari¬ 
ance and correlation matrices are similarly defined as 

Cov(z) :=e[(z-E[z])(z-E[z]) H ] , (2.42) 

where “H” denotes the Hermitian (transposition and conjugation) operation. 

For the rest of the chapter, we are going to deal mainly with real random variables. Whenever 
needed, differences with the case of complex variables will be stated. 
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2.2.5 TRANSFORMATION OF RANDOM VARIABLES 

Let x and y be two random vectors, which are related via the vector transform. 


y = /(*), 


(2.43) 


where / : i—> WJ is an invertible transform. That is, given y, x = f~ l (y) can be uniquely obtained. 

We are given the joint PDF, p x (x ), of x and the task is to obtain the joint PDF, p y (y), of y. 

The Jacobian matrix of the transformation is defined as 


/(y; x) 



3yi 

Syi 

> yi ). 

3jci 

dxi 

,Xl) 

dyi 

3 yi 


3*1 

' 3 xi _ 


(2.44) 


Then, it can be shown (e.g., [6]) that 


v (v) MX) 


‘ |det(7 (y; x))| 

x=f~\y) 


(2.45) 


where |det(-)| denotes the absolute value of the determinant of a matrix. For real random variables, as 
in y = /(x), Eq. (2.45) simplifies to 


p y (y )= 


pAx) 
I — I 

I dx I 


x=f~Hy) 


(2.46) 


The latter can be graphically understood from Fig. 2.1 . The following two events have equal probabil- 
ities: 


P(x < x < x + Ax) = P(y + Ay < y < y), Ax > 0, Ay < 0. 

Hence, by the definition of a PDF we have 

Py(y)|Ay| = p x (x)|Ax|, (2.47) 


which leads to Eq. (2.46). 

Example 2.1. Let us consider two random vectors that are related via the linear transform 

y = Ax, (2.48) 

where A is invertible. Compute the joint PDF of y in terms of p x (x). 

The Jacobian of the transformation is easily computed and given by 


a n ■ 

■ a\i 

«21 ■ 

■ ail 

_ «/i ■ 

■ au _ 


J( y; x) = 


= A. 
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FIGURE 2.1 

Note that by the definition of a PDF, p y (y)\Ay\ = /j x (a')| A,r|. 


Hence, 


Py(y)= 


Px(A Vy) 

| det(A) | ' 


(2.49) 


2.3 EXAMPLES OF DISTRIBUTIONS 

In this section, some notable examples of distributions are provided. These are popular for modeling 
the random nature of variables met in a wide range of applications, and they will be used later in this 
book. 


2.3.1 DISCRETE VARIABLES 

The Bernoulli Distribution 

A random variable is said to be distributed according to a Bernoulli distribution if it is binary, X — 
{0, 1}, with 

P(x=l) = p, P(x = 0)=l- p. 

In a more compact way, we write x ~ Bem(x|/?), where 


P{x) = Bern(x|/>) := p x (l — p) 1 x . 


Its mean value is equal to 


and its variance is equal to 


E[x] = lp+ 0(1 - p) = p 


ffx = (! - P) 2 P + P 2 ( 1 ~ P) = p( 1 - P)- 


(2.50) 

(2.51) 

(2.52) 
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The Binomial Distributiori 

A random variable x is said to follow a binomial distribution with parameters n. p. and we write 
x ~ Bin(.r|«, p), if X = {0,1,..., n} and 


P(x = k) := Bin(£|«, p) = 


P k (1 - P) n ~ k , 


k — 0, 1,... ,n, 


(2.53) 


where by definition 


n ^ n\ 

k ) ~ (n — k)\k\ 


(2.54) 


For example, this distribution models the times that heads occurs in n successive trials, where 
P(Heads) = p. The binomial is a generalization of the Bernoulli distribution, which results if in 
Eq. (2.53) we set n = 1. The mean and variance of the binomial distribution are (Problem 2.1) 


E[x] = np 


(2.55) 


and 


— npi 1 - p). (2.56) 

Fig. 2.2A shows the probability P(k ) as a function of k for p — 0.4 and n = 9. Fig. 2.2B shows the 
respective cumulative distribution. Observe that the latter has a staircase form, as is always the case for 
discrete variables. 




FIGURE 2.2 


(A) The probability mass function (PMF) for the binomial distribution for p = 0.4 and n = 9. (B) The respective 
cumulative distribution function (CDF). Since the random variable is discrete, the CDF has a staircase-like graph. 
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The Multinomial Distributiori 

This is a generalization of the binomial distribution if the outcome of each experiment is not binary 
but can take one out of K possible values. For example, instead of tossing a coin, a die with K sides 
is thrown. Each one of the possible K outcomes has probability P\. Pi,.... Pk , respectively, to occur, 
and we denote 


P = [Pl,P2 . PkV- 

After n experiments, assume that x\ , xi ,..., xk times sides x = 1 , x = 2. x — K occurred, respec¬ 

tively. We say that the random (discrete) vector. 


x=[xi,x 2 , ...,x K y 


(2.57) 


follows a multinomial distribution, x ~ Mult(x|/i, P ), if 


n 


P(x) = Mul.(jr|w, P) := | ^ ^ j f[ 

k=l 


K 


(2.58) 


where 


x\,X2,...,xkJ xi!x 2 !.. .xk ! 
Note that the variables x i..... x ^ are subject to the constraint 


— n, 


k= l 


and also 

K 

E P K = 1 - 

k =1 

The mean value, the variances, and the covariances are given by 

E[x] = nP, <r^ k — nPk( 1 — Pk), k — 1,2 ,, K, cov(x,-, x ; ) = —nPiPj, i ^ j. (2.59) 

The special case of the multinomial, where only one experiment, n = 1, is performed, is known as the 
ccitegorical distribution. The latter can be considered as the generalization of the Bernoulli distribution. 



32 


CHAPTER 2 PROBABILITY AND STOCHASTIC PROCESSES 



p{x) 
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a 


X 


FIGURE 2.3 

The PDF of a uniform distributiori U(a,b). 


2.3.2 CONTINUOUS VARIABLES 

The Uniform Distribution 

A random variable x is said to follow a uniform distribution in an interval [a. b], and we write 
x ~ U(a , b), with a > —oo and b < +oo, if 


Pix) = 


bhr if a<x<b, 
0, otherwise. 


(2.60) 


Fig. 2.3 shows the respective graph. The mean value is equal to 


E[x] = 


a + b 
2 


and the variance is given by (Problem 2.2) 




(2.61) 


(2.62) 


The Gaussian Distribution 

The Gaussian or normal distribution is one among the most widely used distributions in ali scientific 
disciplines. We say that a random variable x is Gaussian or normal with parameters // and er 2 , and we 
write x ~ AA (/x, cr 2 ) or J\T(x | p , a 2 ), if 


1 

pix) = exp 

V2jt(t 


(x-p) 2 \ 
2(7 2 ) 


(2.63) 


It can be shown that the corresponding mean and variance are 

E[x] = p and cr 2 — cr 2 . 


(2.64) 
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FIGURE 2.4 

The graphs of two Gaussian PDFs for /x = 1 and a 2 = 0.1 (red) and a 2 = 0.01 (gray). 


Indeed, by the definition of the mean value, we have 


E[x] = 


1 

y/2lt<7 

1 

\[2jta 



x exp 


(x - Ii ) 2 \ 

2 er 2 ) 


dx 


(j + d) exp 



(2.65) 


Due to the symmetry of the exponential function, performing the integration involving y gives zero 
and the only surviving term is due to /x. Taking into account that a PDF integrates to one, we obtain 
the resuit. 

To derive the variance, from the definition of the Gaussian PDF, we have 



dx = ^2na. 

2 a 2 J 


( 2 . 66 ) 


Taking the derivative of both sides with respect to a, we obtain 

[ + °° (x-a) 2 ( (x — /x) 2 \ /— 

J -j— exp (-j—) dx = \/27r (2.67) 

or 

aZ'-=^=^J (x - /r) 2 exp ^^~ ^ dx = a 2 , (2.68) 

which proves the claim. 

Fig. 2.4 shows the graph for two cases, Af(x \ 1,0.1) and Af(x\ 1,0.01). Both curves are symmetri- 
cally placed around the mean value \x — 1. Observe that the smaller the variance is, the sharper around 
the mean value the PDF becomes. 
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FIGURE 2.5 

The graphs of two-dimensional Gaussian PDFs for /i = 0 and different covariance matrices. (A) The covariance 
matrix is diagonal with equal elements along the diagonal. (B) The corresponding covariance matrix is nondiagonal. 


The generalization of the Gaussian to vector variables, xef ( , results in the so-called multivariate 
Gaussian or normal distribution, x ~ Af (x\p, E) with parameters p and E. which is dehned as 


pix) = 


(2it) l / 2 \E\ 1 / 2 


exp 


1 

" 2 (X 


■ p) 1 E 1 (x 


p) 


Gaussian PDF, 


(2.69) 


where | • | denotes the determinant of a matrix. It can be shown (Problem 2.3) that the respective mean 
values and the covariance matrix are given by 

E[x] = p and E x — E. (2.70) 


Fig. 2.5 shows the two-dimensional normal PDF for two cases. Both share the same mean vector, 
p = 0, but they have different covariance matrices, 


Ei = 


0.1 0.0 

0.0 0.1 


r 2 = 


0.1 0.01 
0.01 0.2 


(2.71) 


Fig. 2.6 shows the corresponding isovalue contours for equal probability density values. In 
Fig. 2.6A, the contours are circles, corresponding to the symmetric PDF in Fig. 2.5A with covari¬ 
ance matrix E i. The one shown in Fig. 2.6B corresponds to the PDF in Fig. 2.5B associated with XV 
Observe that, in general, the isovalue curves are ellipses/hyperellipsoids. They are centered at the mean 
vector, and the orientation of the major axis as well their exact shape is controlled by the eigenstructure 
of the associated covariance matrix. Indeed, ali points xel, which score the same probability density 
value, obey 

(x — p) T E~ l (x — p) — constant = c. 


(2.72) 
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FIGURE 2.6 

The isovalue contours for the two Gaussians of Fig. 2.5. The contours for the Gaussian in Fig. 2.5 A are circles, 
while those corresponding to Fig. 2.5B are ellipses. The major and minor axes of the ellipse are determined by the 
eigenvectors/eigenvalues of the respective covariance matrix, and they are proportional to yfk\c and s/X. 2 c, respec- 
tively. In the figure, they are shown for the case of c = 1. For the case of the diagonal matrix, with equal elements 
along the diagonal, all eigenvalues are equal, and the ellipse becomes a circle. 


We know that the covariance matrix besides being positive definite is also symmetric, E = E 1 . Thus, 
its eigenvalues are real and the corresponding eigenvectors can be chosen to forni an orthonormal basis 
(Appendix A. 2), which leads to its diagonalization, 


Z = UAU t , 


with 


U :=[wi, 

where i = 1 , 2 ,..., /, are the orthonormal eigenvectors, and 


(2.73) 


(2.74) 


A:=diag{A.i,...,A./} (2.75) 

comprise the respective eigenvalues. We assume that E is invertible, hence all eigenvalues are positive 
(being positive definite it has positive eigenvalues, Appendix A. 2). Due to the orthonormality of the 
eigenvectors, matrix U is orthogonal as expressed in UU T = \J T U — I. Thus, Eq. (2.72) can now be 
written as 


y T A~ { y — c, (2.76) 

where we have used the linear transformation 

y := U T (x — /i), 


(2.77) 
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which corresponds to a rotation of the axes by U and a translation of the origin to fi. Eq. (2.76) can be 
written as 


i i 

*+...+ >Uc, 

A-i */ 


(2.78) 


where it can be readily observed that it is an equation describing a (hyper)ellipsoid in the M 1 . From 
Eq. (2.77), it is easily seen that it is centered at p and that the major axes of the ellipsoid are parallel to 
«l,..., uj (plug in place of x the Standard basis vectors, [ 1.0,..., 0] 7 Y etc.). The sizes of the respective 
axes are controlled by the corresponding eigenvalues. This is shown in Fig. 2.6B. For the special case 
of a diagonal covariance with equal elements across the diagonal, ali eigenvalues are equal to the 
value of the common diagonal element and the ellipsoid becomes a (hyper)sphere (circle) as shown in 
Fig. 2.6A. 

The Gaussian PDF has a number of nice properties, which we are going to discover as we move on 
in this book. For the time being, note that if the covariance matrix is diagonal. 


E = diag{of,..., <t 7 2 }. 


that is, when the covariance of ali the elements cov(x,-, xj) = 0, i, j = 1,2,...,/, then the random 
variables comprising x are statistically independent. In general, this is not true. Uncorrelated variables 
are not necessarily independent; independence is a much stronger condition. This is true, however, if 
they follow a multivariate Gaussian. Indeed, if the covariance matrix is diagonal, then the multivariate 
Gaussian is written as 


In other words, 


p(x) = ] _ [ 

i=i 


1 



(Xj - p.j) 2 \ 

2a, 2 ) ' 


1 

p(x) = n p(xi), 

i= 1 


(2.79) 


(2.80) 


which is the condition for statistical independence. 


The Central Limit Theorem 

This is one of the most fundamental theorems in probability theory and statistics and it partly explains 
the popularity of the Gaussian distribution. Consider N mutually independent random variables, each 
following its own distribution with mean values /z, and variances of , i = 1,2,..., N. Define a new 
random variable as their sum, 

N 

x=J>. (2.81) 

/=l 

Then the mean and variance of the new variable are given by 

N N 

= and o 2 — Y] of. 

(=1 /=i 


(2.82) 
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It can be shown (e.g., [4,6]) that as N —»■ oo the distributiori of the normalized variable 


x — fl 


(2.83) 


z 


■ 


tends to the Standard normal distributiori, and for the corresponding PDF we have 


p(z) -> A/"(z|0,1). 

N—>oo 


(2.84) 


In practice, even summing up a relatively small number, N, of random variables, one can obtain a 
good approximation to a Gaussian. For example, if the individual PDFs are smooth enough and each 


random variable is independent and identically distributed (i.i.d.), a number N between 5 and 10 can 


be sufficient. The term i.i.d. will be used a lot in this book. The term implies that successive samples 
of a random variable are drawn independently from the same distribution that describes the respective 
variable. 

The ExponentiaI Distribution 

We say that a random variable follows an exponential distribution with parameter X > 0, if 


f lexp(-kx), 
Pix) = I 

if x > 0, 

l°> 

otherwise. 


The distribution has been used, for example, to model the time between arrivals of telephone calls or 
of a bus at a bus stop. The mean and variance can be easily computed by following simple integration 
rules, and they are 



1 

X 2 ' 


( 2 . 86 ) 


The Beta Distribution 

We say that a random variable, x e [0, 1], follows a beta distribution with positive parameters, a, b, and 
we write, x ~ Beta(x|a, b ,), if 


p(x) = B(a, b) 

0 , 


x a ~ l {\ -x) b ~ l , if 0 < x < 1, 


(2.87) 


otherwise, 


where B(a,b ) is the beta function, defined as 



( 2 . 88 ) 


J o 

The mean and variance of the beta distribution are given by (Problem 2.4) 



ab 


(a + b) 2 {a +b + 1) 


(2.89) 
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(A) (B) 


FIGURE 2.7 

The graphs of the PDFs of the beta distributiori for different values of the parameters. (A) The dotted line cor- 
responds to a = 1, b = 1, the gray line to a = 0.5, b = 0.5, and the red one to a = 3, b = 3. (B) The gray line 
corresponds to a = 2, b = 3, and the red one to a = 8, b = 4. For values a=b , the shape is symmetric around 1 /2. 
For a < 1, b < 1, it is convex. For a > 1, b > 1, it is zero at x = 0 and x — 1. For a = 1 = b, it becomes the uniform 
distribution. If a < 1 , p(x) — > oo, x —» 0 and if b < 1 , pix) —> oo, x —> 1. 


Moreover, it can be shown (Problem 2.5) that 


B(a, b ) 


r(q)r(fr) 

r (a + b) ’ 


where F is the gamma function defined as 


(2.90) 


I» = 



dx. 


(2.91) 


The beta distribution is very flexible and one can achieve various shapes by changing the parameters 
a, b. For example, if ci — b — 1, the uniform distribution results. If a = b. the PDF has a symmetric 
graph around 1/2. If a > 1, b > 1, then p(x ) —> 0 both at x = 0 and x = 1. If a < 1 and b < 1, it 
is convex with a unique minimum. If a < 1, it tends to oo as x —> 0, and if b < 1, it tends to oo for 
x —> 1. Figs. 2.7A and B show the graph of the beta distribution for different values of the parameters. 


The Gamma Distribution 

A random variable follows the gamma distribution with positive parameters ci,b, and we write 
x ~ Gamma(.r|a, /;), if 



otherwise. 


(2.92) 
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FIGURE 2.8 

The PDF of the gamma distributiori takes different shapes for the various values of the following parameters: 
a = 0.5, b = 1 (full line gray), a = 2,b = 0.5 (red), a = 1, b = 2 (dotted). 


The mean and variance are given by 


„ a t a 

EM=-, a x 2 =^. (2.93) 

The gamma distribution also takes various shapes by varying the parameters. For a < 1, it is strictly 
decreasing and p(x ) —> oo as x —> 0 and p(x) —> 0 as x —> oo. Fig. 2.8 shows the resulting 
graphs for various values of the parameters. 

Remarks 2.1. 

• Setting in the gamma distribution a to be an integer (usually a — 2), the Erlcing distribution results. 
This distribution is used to model waiting times in queueing systems. 

• The chi-squared is also a special case of the gamma distribution, and it is obtained if we set b — 1 /2 
and a = v/2. The chi-squared distribution results if we sum up v squared normal variables. 

The Dirichlet Distribution 

The Dirichlet distribution can be considered as the multivariate generalization of the beta distribution. 
Let x = [xi,..., xk V be a random vector, with components such as 

K 

0<X£<1, k = \,2,...,K, and = 1. (2.94) 

k= 1 

In other words, the random variables lie on (K — l)-dimensional simplex, Fig. 2.9. We say that the 
random vector x follows a Dirichlet distribution with (positive) parameters a — [a\,..., a^] T , and we 
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FIGURE 2.9 

The two-dimensional simplex in R 3 . 


write x ~ Dir(x |a), if 


p( x) = 13ir(x \a) 


r (a) 

r(ai)...r(a z ) 


ITT 


k= 1 


(2.95) 


where 


K 

a = y ] . 

k=\ 


(2.96) 



FIGURE 2.10 


The Dirichlet distribution over the two-dimensional simplex for (A) (0.1,0.1,0.1), (B) (1,1,1), and (C) (10,10,10). 
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The mean, variance, and covariances of the involved random variables are given by (Problem 2.7), 
i.e., 


E[x] = 


1 

-a, 

a 




(a — ah) 
a 2 (a + 1) 


COV (X/, 


ai a 

/) -9 /- 


a 2 (a + 1) ’ 


i ^ j- 


(2.97) 


Fig. 2.10 shows the graph of the Dirichlet distribution for different values of the parameters, over the 
respective two-dimensional simplex. 


2.4 STOCHASTIC PROCESSES 

The notion of a random variable has been introduced to describe the resuit of a random experiment 
whose outcome is a single value, such as heads or tails in a coin tossing experiment, or a value between 
one and six when throwing the die in a backgammon game. 

In this section, the notion of a stochastic process is introduced to describe random experiments 
where the outcome of each experiment is a function or a sequence; in other words, the outcome of 
each experiment is an infinite number of values. In this book, we are only going to be concerned with 
stochastic processes associated with sequences. Thus, the resuit of a random experiment is a sequence, 
u n (or sometimes denoted as u{n)), n e Z, where Z is the set of integers. Usually, n is interpreted as 
a time index, and u n is called a time series, or in signal processing jargon, a discrete-time signal. In 
contrast, if the outcome is a function, u(t). it is called a continuous-time signal. We are going to adopt 
the time interpretation of the free variable n for the rest of the chapter, without harming generality. 

When discussing random variables, we used the notation x to denote the random variable, which 
assumes a value, x, from the sample space once an experiment is performed. Similarly, we are going 
to use u n to denote the specific sequence resulting from a single experiment and the roman font, u„, to 
denote the corresponding discrete-time random process, that is, the rule that assigns a specific sequence 
as the outcome of an experiment. A stochastic process can be considered as a family or ensemble of 
sequences. The individual sequences are known as sample sequences or simply as realizations. 

For our notational convention, in general, we are going to reserve different symbols for processes 
and random variables. We have already used the Symbol u and not x; this is only for pedagogical 
reasons, just to make sure that the reader readily recognizes when the focus is on random variables and 
when it is on random processes. In signal processing jargon, a stochastic process is also known as a 
random signal. Fig. 2.11 illustrates the fact that the outcome of an experiment involving a stochastic 
process is a sequence of values. 

Note that fixing the time to a specific value, n — no, makes u„ 0 a random variable. Indeed, for each 
random experiment we perform, a single value results at time instant hq. From this perspective, a ran¬ 
dom process can be considered the collection of infinite random variables, {u„, n e Z}. So, is there a 
need to study a stochastic process separate from random variables/vectors? The answer is yes, and the 
reason is that we are going to allow certain time dependencies among the random variables, corre¬ 
sponding to different time instants, and study the respective effect on the time evolution of the random 
process. Stochastic processes will be considered in Chapter 5, where the underlying time dependen¬ 
cies will be exploited for computational simplifications, and in Chapter 13 in the context of Gaussian 
processes. 
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FIGURE 2.11 

The outcome of each experiment, associated with a discrete-time stochastic process, is a sequence of values. For 
each one of the realizations, the corresponding values obtained at any instant (e.g., n or m) comprise the outcomes 
of a corresponding random variable, u„ or u,„, respectively. 


2.4.1 FIRST- AND SECOND-ORDER STATISTICS 

For a stochastic process to be fully described, one must know the joint PDFs (PMFs for discrete-valued 
random variables) 

p(u n , u m , ..., u r \ n, m, ..., r), (2.98) 

for all possible combinations of random variables, u„, u m , ..., u, . Note that, in order to emphasize it, 
we have explicitly denoted the dependence of the joint PDFs on the involved time instants. Flowever, 
from now on, this will be suppressed for notational convenience. Most often, in practice, and certainly 
in this book, the emphasis is on computing first- and second-order statistics only, based on p{u n ) and 
p{u n , u m ). To this end, the following quantities are of particular interest. 

Mean at time n : 

/ +oo 

tt-n p(tin)du n . (2.99) 

-oo 

Autocovariance at time instants n , m : 

cov(m, m) :=E[(u„ -E[u„])(u,„ -E[u m ])]. (2.100) 

Autocorrelation at time instants n , m : 

r(n,m) :=E[u„u m ]. (2.101) 

Note that for notational simplicity, subscripts have been dropped from the respective symbols, e.g., 
r(n,m) is used instead of the more formal notation, r u (n , m ). We refer to these mean values as ensemble 
averages to stress that they convey statistical information over the ensemble of sequences that comprise 
the process. 

The respective dehnitions for complex stochastic processes are 

cov(n, m) = E [(u„ - E[u„])(u m - E[u m ])*], (2.102) 


and 


r(n, m) = E [u„u* ]. 


(2.103) 
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2.4.2 STATIONARITY AND ERGODICITY 

Definition 2.1 ( Strict-sense stationarity). A stochastic process u„ is said to be strict-sense stationary 
(SSS) if its statistical properties are invariant to a shift of the origin, or if VieZ 

p{u n ,u m , U r ) — p(u n -k, U m -k, ■ ■ ■ . U r -k ), (2.104) 

and for any possible combination of time instants, n,m,, r e Z. 

In other words, the stochastic processes u„ and u„_/, are described by the same joint PDFs of all 
orders. A weaker version of stationarity is that of the mth-order stationarity, where joint PDFs involving 
up to m variables are invariant to the choice of the origin. For example, for a second-order (m = 2) 
stationary process, we have p(u n ) = p(u n -k) and p(u n , u r ) = p(u n -k, u r -k), Vn, r,k e Z. 

Definition 2.2 ( Wide-sense stationarity). A stochastic process u„ is said to be wide-sense stationary 
(WSS) if the mean value is constant over all time instants and the autocorrelation/autocovariance se- 
quences depend on the difference of the involved time indices, or 

p n =p, and r(n,n — k) = r(k). (2.105) 

Note that WSS is a weaker version of the second-order stationarity; in the latter case, all possible 
second-order statistics are independent of the time origin. In the former, we only require the autocorre- 
lation (autocovariance) and the mean value to be independent of the time origin. The reason we focus 
on these two quantities (statistics) is that they are of major importance in the study of linear systems 
and in the mean-square estimation, as we will see in Chapter 4. 

Obviously, an SSS process is also WSS but, in general, not the other way around. For WSS pro¬ 
cesses, the autocorrelation becomes a sequence with a single time index as the free parameter; thus its 
value, which measures a relation of the random variables at two time instants, depends solely on how 
much these time instants differ , and not on their specific values. 

From our basic statistics course, we know that given a random variable x, its mean value can be 
approximated by the sample mean. Carrying out N successive independent experiments, let x„ , n — 
1, 2,..., N, be the obtained values, known as observations. The sample mean is defined as 

1 N 

Pn := — y^x n - (2.106) 

n= 1 

For large enough values of N , we expect the sample mean to be close to the true mean value, E[x]. In a 
more formal way, this is guaranteed by the fact that p^ is associated with an unbiased and consistent 
estimator. We will discuss such issues in Chapter 3; however, we can refresh our memory at this point. 
Every time we repeat the N random experiments, different samples resuit and hence a different estimate 
p N is computed. Thus, the values of the estimates define a new random variable, |i ( ,, known as the 
estimator. This is unbiased, because it can easily be shown that 

E[|ijv] = E[x], (2.107) 

and it is consistent because its variance tends to zero as N —> +oo (Problem 2.8). These two prop¬ 
erties guarantee that, with high probability, for large values of N , p ,y will be close to the true mean 
value. 
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To apply the concept of sample mean approximation to random processes, one must have at her/his 
disposal a number of N realizations, and compute the sample mean at different time instants “across the 
process,” using different realizations, representing the ensemble of sequences. Similarly, sample mean 
arguments can be used to approximate the autocovariance/autocorrelation sequences. However, this 
is a costly operation, since now each experiment results in an infinite number of values (a sequence 
of values). Moreover, it is common in practical applications that only one realization is available to 
the user. 

To this end, we will now define a special type of stochastic processes, where the sample mean 
operation can be significantly simplified. 

Definition 2.3 ( Ergodicity ). A stochastic process is said to be ergodic if the complete statistics can be 
determined by any one of the realizations. 

In other words, if a process is ergodic, every single realization carries identical statistical informa- 
tion and it can describe the entire random process. Since from a single sequence only one set of PDFs 
can be obtained, we conclude that every ergodic process is necessarily stationary. A nonstationary 
process has infinite sets of PDFs, depending upon the choice of the origin. For example, there is only 
one mean value that can resuit from a single realization and be obtained as a (time) average over the 
values of the sequence. Hence, the mean value of a stochastic process that is ergodic must be constant 
for all time instants, or independent of the time origin. The same is true for ali higher-order statistics. 

A special type of ergodicity is that of the second-order ergodicity. This means that only statistics 
up to a second order can be obtained from a single realization. Second-order ergodic processes are 
necessarily WSS. For second-order ergodic processes, the following are true: 


E[u„] = /i,= lim fis/. 


N->-oq 


(2.108) 


where 



Also, 



n=-N 


(2.109) 


Random 

Process 



Ensemble average 
“across” 


FIGURE 2.12 


For ergodic processes, the common mean value, for all time instants (ensemble averaging “across” the process), is 
computed as the time average “along” the process. 
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where both limits are in the mean-square sense; that is, 

lim E [iAat — /r| 2 l = 0, 

N—>oo L J 

and similarly for the autocovariance. Note that often, ergodicity is only required to be assumed for the 
computation of the mean and covariance and not for ali possible second-order statistics. In this case, 
we talk about mean-ergodic and covariance-ergodic processes. 

In summary, when ergodic processes are involved, ensemble averages “across the process” can be 
obtained as time averages “along the process”', see Fig. 2.12. 

In practice, when only a finite number of samples from a realization is available, then the mean and 
covariance are approximated as the respective sample means. 

An issue is to establish conditions under which a process is mean-ergodic or covariance-ergodic. 
Such conditions do exist, and the interested reader can find such information in more specialized books 
[6]. It turns out that the condition for mean-ergodicity relies on second-order statistics and the condition 
for covariance-ergodicity on fourth-order statistics. 

It is very common in statistics as well as in machine learning and signal processing to subtract 
the mean value from the data during the preprocessing stage. In such a case, we say that the data are 
centered. The resulting new process has now zero mean value, and the covariance and autocorrelation 
sequences coincide. From now on, we will assume that the mean is known (or computed as a sample 
mean) and then subtracted. Such a treatment simplifies the analysis without harming generality. 

Example 2.2. The goal of this example is to construet a process that is WSS yet not ergodic. Let a 
WSS process, u„, 


and 


E[u„] = ju, 

E[u„u„_jt] =r u {k). 


Define the process 


v„:=au„, (2.110) 

where a is a random variable taking values in {0, 1}, with probabilities P(0) = P{ 1) = 0.5. Moreover, 
a and u„ are statistically independent. Then, we have 


E[v„] = E[au„] = E[a] E[u„] = 0.5/r, 


( 2 . 111 ) 


and 

E[v„v„_fc] =E[a 2 ]E[u„u„_i] = 0.5r u (k). (2.112) 

Thus, v„ is WSS. However, it is not ergodic. Indeed, some of the realizations will be equal to zero 
(when a = 0), and the mean value and autocorrelation, which will resuit from them as time averages, 
will be zero, which is different from the ensemble averages. 
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2.4.3 POWER SPECTRAL DENSITY 

The Fourier transform is an indispensable tool for representing in a compact way, in the frequency 
domain, the variations that a function/sequence undergoes in terms of its free variable (e.g., time). 
Stochastic processes are inherently related to time. The question that is now raised is whether stochastic 
processes can be described in terms of a Fourier transform. The answer is affirmative, and the vehicle to 
achieve this is via the autocorrelation sequence for processes that are at least WSS. Prior to providing 
the necessary definitions, it is useful to summarize some common properties of the autocorrelation 
sequence. 

Properties of the Autocorrelation Sequence 

Let u ( , be a WSS process. Its autocorrelation sequence has the following properties, which are given 
for the more general complex-valued case: 

• Property I. 

r(k) = r*(-k), V/fceZ. (2.113) 

This property is a direct consequence of the invariance with respect to the choice of the origin. 
Indeed, 

r(k) = E[u n u*_ k ] = E[u„+*u*] = r*(-k). 

• Property II. 

r(0) = E[|u„| 2 ]. (2.114) 

That is, the value of the autocorrelation at k — 0 is equal to the mean-square of the magnitude of 
the respective random variables. Interpreting the square of the magnitude of a variable as its energy, 
/■(0) can be interpreted as the corresponding (average) power. 

• Property III. 

r(0) > \r(k)\, V/fc^O. (2.115) 

The proof is provided in Problem 2.9. In other words, the correlation of the variables, corresponding 
to two different time instants, cannot be larger (in magnitude) than r (0). As we will see in Chapter 4, 
this property is essentially the Cauchy-Schwarz inequality for the inner products (see also Appendix 
of Chapter 8). 

• Property IV. The autocorrelation sequence of a stochastic process is positive definite. That is, 


N N 

EE a n a* n r(n. m ) > 0, Va n e C, n = 1, 2,..., N, VN e Z. 

n= 1 m —1 


(2.116) 


Proof. The proof is easily obtained by the definition of the autocorrelation, 

N 2 N N 

° < e [| e «» u » | ] = E E a >' a *n E K u '»] ■ 

n= 1 n=lm=l 


(2.117) 




2.4 STOCHASTIC PROCESSES 47 


which proves the claim. Note that strictly speaking, we should say that it is semipositive definite. 
However, the “positive definite” name is the one that has survived in the literature. This property 
will be useful when introducing Gaussian processes in Chapter 13. □ 

• Property V. Let u„ and v„ be two WSS processes. Define the new process 


Z n — tl H + V„. 


Then, 


r 7 (k) = r u (k) + r v (k) + r av {k) + r vu (A:), 

where the cross-correlation between two jointly WSS stochastic processes is defined as 


(2.118) 


r uv (k) := E[u„v*_j.], k e 7L : cross-correlation. 


(2.119) 


The proof is a direct consequence of the definition. Note that if the two processes are uncorrelated , 
as when r m (k) = r vu (k ) = 0, then 


r z (k) = r u (k) + r v (k). 

Obviously, this is also true if the processes u„ and v„ are independent and of zero mean value, since 
then E[u„v*_^] = E[u„]E[v*_,] = 0. It should be stressed here that uncorrelatedness is a weaker 
condition and it does not necessarily imply independence; the opposite is true for zero mean values. 

• Property VI. 

r m .(k) = r*, VL (-k). (2.120) 

The proof is similar to that of Property I. 

• Property VII. 

ru(0)r v (0)>\r av {k)\ 2 , Vk e Z. (2.121) 

The proof is also given in Problem 2.9. 

Power SpectraI Density 

Definition 2.4. Given a WSS stochastic process u„, its power spectral density (PSD) (or simply the 
power spectrum) is defined as the Fourier transform of its autocorrelation sequence, 


OO 

S(co) := E r(k) exp (—ja>k) : power spectral density. 

k =—oo 


( 2 . 122 ) 


Using the Fourier transform properties, we can recover the autocorrelation sequence via the inverse 
Fourier transform in the following manner: 


1 r +n 

r(k) = — / S(co) exp ( jwk ) da. 

2?r J_ n 


(2.123) 
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Due to the properties of the autocorrelation sequence, the PSD has some interesting and useful 
properties, from a practical point of view. 

Properties ofthe PSD 

• The PSD of a WSS stochastic process is a real and nonnegative function of a>. Indeed, we have 


+OO 

S(co) = ^ r(k) exp (— jcok) 
k=—o o 

— 1 oo 

= r (0) + ^ r(k) exp (—jcok) + ^ r(k) exp (—jcok) 
k =—oo k =1 

+oo 00 

= r(0) + r*(k) exp (jcok) + y^r(k) exp (—jcok) 

k= 1 k=\ 

+oo 

= r (0) + 2 Real{r(£) exp (-jcok )}, (2.124) 

k=l 


which proves the claim that PSD is a real number. In the proof, Property I of the autocorrelation 
sequence has been used. We defer the proof concerning the nonnegativity to the end of this section. 
• The area under the graph of S(co) is proportional to the power of the stochastic process, as expressed 
by 


E [|u (1 | 2 ] = r(0) = i- j +7T S(a>)dco, (2.125) 


which is obtained from Eq. (2.123) if we set k = 0. We will come to the physical meaning of this 
property very soon. 


Transmission Through a Linear System 

One of the most important tasks in signal processing and systems theory is the linearfiltering operation 
on an input time series (signal) to generate another output sequence. The block diagram of the filtering 
operation is shown in Fig. 2.13. From the linear system theory and signal processing basies, it is 
established that for a class of linear systems known as linear time-invariant, the input-output relation 
is given via the elegant convolution between the input sequence and the impulse response of the filter. 


+oo 


d n = U) n * U n ■— ^ ^ U n—i 

/=—oo 

convolution sum, 


(2.126) 


where ..., wo, toj, W 2 ,... are the parameters comprising the impulse response describing the filter [8]. 
In case the impulse response is of finite duration, for example, wq, w\, ..., u>/_i , and the rest of the 


- Recall that if z = a + jb is a complex number, its real part Real{r} = a = \ (z — r*). 
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Input 


Linear 

System 

Wn 


-- dn 

Output 


FIGURE 2.13 

The linear system (filter) is excited by the input sequence (signal) u„ and provides the output sequence (signal) d„. 


values are zero, then the convolution can be written as 


l -1 

d„ = ^ w*u n -i = (2.127) 

/=o 

where 

u; := [w 0 , uq,..., w/_i] r , (2.128) 

and 

u„ := ...,u„_/ + i] 7 ' eW 1 . (2.129) 

The latter is known as the input vector of order / and at time n. It is interesting to note that this is 
a random vector. However, its elements are part of the stochastic process at successive time instants. 
This gives the respective autocorrelation matrix certain properties and a rich structure, which will 
be studied and exploited in Chapter 4. As a matter of fact, this is the reason that we used different 
symbols to denote processes and general random vectors; thus, the reader can readily remember that 
when dealing with a process, the elements of the involved random vectors have this extra structure. 
Moreover, observe from Eq. (2.126) that if the impulse response of the system is zero for negative 
values of the time index n, this guarantees causality. That is, the output depends only on the values of 
the input at the current and previous time instants, and there is no dependence on future values. As a 
matter of fact, this is also a necessary condition for causality; that is, if the system is causal, then its 
impulse response is zero for negative time instants [8]. 

Theorem 2.1. The PSD ofthe output d„ ofa linear time-invariant system, when it is excited by a WSS 
stochastic process u„, is given by 


where 


Sdl&O = \ W(co)\ 2 S u (a>), 


+oo 

W(co) := ^ w n exp (—jam). 

n=—oo 


Proof First, it is shown (Problem 2.10) that 


rd(k) = r u (k) *wk*w*_ k . 


(2.130) 


(2.131) 


(2.132) 
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FIGURE 2.14 

An ideal bandpass filter. The output contains frequencies only in the range of \co — a> 0 \ < Aco/2. 


Then, taking the Fourier transform of both sides, we obtain Eq. (2.130). To this end, we used the 
well-known properties of the Fourier transform, 

r u ( k) * Wk i—> 5 u (a>)VP(a)), and w^ k i—>• W*(co). □ 


Physical Interpretation of the PSD 

We are now ready to justify why the Fourier transform of the autocorrelation sequence was given 
the specific name of “power spectral density.” We restrict our discussion to real processes, although 
similar arguments hold true for the more general complex case. Fig. 2.14 shows the magnitude of the 
Fourier transform of the impulse response of a very special linear system. The Fourier transform is 
unity for any frequency in the range \co — co 0 \ < and zero otherwise. Such a system is known as 
bandpass filter. We assume that Aco is very small. Then, using Eq. (2.130) and assuming that within 
the intervals \w — co Q \ < SuOu) ~ Su (&>„)> we have 


Sd(«) = 


Su{o>o), if\co-co 0 \<^, 
0, otherwise. 


Hence, 

1 C + °° 

AF:=E[|d„| 2 ] = r d (0) = — J S A (co)dw 
due to the symmetry of the PSD (S u (a>) = S u (—a>)). Hence, 


.S L i (o) n ) 


Aco 

7T 


1 

SuiOJo) 

Tt 


A P 
Aco 


(2.133) 


(2.134) 


(2.135) 


In other words, the value S u (« 0 ) can be interpreted as the power density (power per frequency interval) 
in the frequency (spectrum) domain. 
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Moreover, this also establishes what was said before that the PSD is a nonnegative real function 
for any value of o> e [—tt, +jt] (the PSD, being the Fourier transform of a sequence, is periodic with 
period 2 tt, e.g., [8]). 

Remarks 2.2. 

• Note that for any WSS stochastic process, there is only one autocorrelation sequence that describes 
it. However, the converse is not true. A single autocorrelation sequence can correspond to more 
than one WSS process. Recall that the autocorrelation is the mean value of the product of random 
variables. However, many random variables can have the same mean value. 

• We have shown that the Fourier transform, S(o>). of an autocorrelation sequence, r(k), is nonneg¬ 
ative. Moreover, if a sequence r(k) has a nonnegative Fourier transform, then it is positive definite 
and we can always construet a WSS process that has r(k) as its autocorrelation sequence (e.g., [6, 
pages 410, 421]). Thus, the necessary and sufficient condition for a sequence to be an autocorrela¬ 
tion sequence is the nonnegativity ofits Fourier transform. 

Example 2.3 (White noise sequence). A stochastic process is said to be white noise if the mean and 
its autocorrelation sequence satisfy 


\ <t~, if k = 0, 

E[ri„]=0 and r(k) — < 1 : 

white noise, 

o 

% 

cd 



(2.136) 


where is its variance. In other words, all variables at different time instants are uncorrelated. If, in 
addition, they are independent, we say that it is strictly white noise. It is readily seen that its PSD is 
given by 

$„(©) = o-2. (2.137) 

That is, it is constant, and this is the reason it is called white noise, analogous to white light, whose 
spectrum is equally spread over all wavelengths. 

2.4.4 AUTOREGRESSIVE MODELS 

We have just seen an example of a stochastic process, namely, white noise. We now turn our attention 
to generating WSS processes via appropriate modeling. In this way, we will introduce controlled corre- 
lation among the variables, corresponding to the various time instants. We focus on the real data case, 
to simplify the discussion. 

Autoregressive processes are among the most popular and widely used models. An autoregressive 
process of order /, denoted as AR(/), is defined via the following difference equation: 


u„ + fliu„_i + • • • + fl/u„_; = r\ n : autoregressive process, 


(2.138) 


where r|„ is a white noise process with variance cr“. 

As is always the case with any difference equation, one starts from some initial conditions and 
then generates samples recursively by plugging into the model the input sequence samples. The input 
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samples here correspond to a white noise sequence and the initial conditions are set equal to zero, 

m_ i = ... U-l — 0. 

There is no need to mobilize mathematics to see that such a process is not stationary. Indeed, time 
instant n = 0 is distinctly different from ali the rest, since it is the time in which initial conditions are 
applied. However, the effects of the initial conditions tend asymptotically to zero if ali the roots of the 
corresponding characteristic polynomial, 

z^ + a\z! * + ■■■+ n/ = 0, 

have magnitude less than unity (the solution of the corresponding homogeneous equation, without 
input, tends to zero) [7]. Then, it can be shown that asymptotically, the AR(/) becomes WSS. This is 
the assumption that is usually adopted in practice, which will be the case for the rest of this section. 
Note that the mean value of the process is zero (try it). 

The goal now becomes to compute the corresponding autocorrelation sequence, r(k). k e Z. Mul- 
tiplying both sides in Eq. (2.138) with u k > 0, and taking the expectation, we obtain 

l 

7> E[u„_iU„_fc] = E[r)„u„_k], k > 0, 

i =0 


where aq := 1, or 


l 

cij r(k — i) — 0. 

i=0 


(2.139) 


We have used the fact that E[r)„u„_jt], k > 0, is zero. Indeed, u„_^ depends recursively on r\ n -k, 
which are ali uncorrelated to r|„, since this is a white noise process. Note that Eq. (2.139) 
is a difference equation, which can be solved provided we have the initial conditions. To this end, 
multiply Eq. (2.138) by u„ and take expectations, which results in 


1 

J2a i r(i) = a; ] , 
i =0 


(2.140) 


since u„ recursively depends on r|„, which contributes the a“ term, and q„_i,..., which resuit to zeros. 
Combining Eqs. (2.140)-(2.139) the following linear system of equations results: 


~r( 0) r(l) . 

/■(1) r( 0) . 

■ r(l) - 

■ r(Z - 1) 


" 1 " 

a\ 


■ 

_1 

_ r{l) r{l — 1) . 

■ r( 0) . 


_ a l _ 


_ 0 _ 


(2.141) 


These are known as the Yule-Walker equations, whose solution results in the values /■(()),..., r(l), 
which are then used as the initial conditions to solve the difference equation in (2.139) and obtain 
r(k),Wke Z. 
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Observe the special structure of the matrix in the linear system. This type of matrix is known as 
Toeplitz, and this is the property that will be exploited to efficiently solve such systems, which resuit 
when the autocorrelation matrix of a WSS process is involved; see Chapter 4. 

Besides the autoregressive models, other types of stochastic models have been suggested and used. 
The ciutoregressive-moving average (ARMA) model of order (/, m) is dehned by the difference equa- 
tion 


u>; + rqu, ; _[ T ... T a/Un-i — b\f] n T~ • ■ • ~E b m \\n—m* 
and the moving average model of order m, denoted as MA(m), is dehned as 


U n — “E ' ' ' “E bin^n-m- 


(2.142) 

(2.143) 



FIGURE 2.15 


(A) The time evolution of a realization of the AR(1) with a = —0.9 and (B) the respective autocorrelation sequence. 
(C) The time evolution of a realization of the AR(1) with a = —0.4 and (D) the corresponding autocorrelation 
sequence. 
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Note that the AR(1) and the MA(;«) models can be considered as special cases of the ARMA(/, m ). For 
a more theoretical treatment of the topic, see [1]. 

Example 2.4. Consider the AR(1) process, 


U/z "b ^U/z —1 — T]zz . 


Following the general methodology explained before, we have 

r(k) + ar{k- 1) = 0, k = 1,2,..., 
r(0) + ar(l) = er^. 

Taking the first equation for k — I together with the second one readily results in 


Plugging this value into the difference equation, we recursively obtain 

2 

r(k) = (-a)' k 1 — k = 0,±1,±2,.... (2.144) 

1 — a 1 

where we used the property r(k) — r(—k). Observe that if \a\ > 1, r(0) <0, which is meaningless. 
Also, \a\ < 1 guarantees that the magnitude of the root of the characteristic polynomial (z* = —a) 
is smaller than one. Moreover, \a\ < 1 guarantees that r(k) —> 0 as k —> oo. This is in line with 
common sense, since variables that are far away must be uncorrelated. 

Fig. 2.15 shows the time evolution of two AR(1) processes (after the processes have converged 
to be stationary) together with the respective autocorrelation sequences, for two cases, corresponding 
to a — —0.9 and a = —0.4. Observe that the larger the magnitude of a, the smoother the realization 
becomes and time variations are slower. This is natural, since nearby samples are highly correlated 
and so, on average, they tend to have similar values. The opposite is true for small values of a. For 
comparison purposes, Fig. 2.16A is the case of a = 0, which corresponds to a white noise. Fig. 2.16B 
shows the PSDs corresponding to the two cases of Fig. 2.15. Observe that the faster the autocorrelation 
approaches zero, the more spread out the PSD is, and vice versa. 


2.5 INFORMATION THEORY 

So far in this chapter, we have looked at some basic definitions and properties concerning probability 
theory and stochastic processes. In the same vein, we will now focus on the basic definitions and notions 
related to information theory. Although information theory was originally developed in the context of 
Communications and coding disciplines, its application and use has now been adopted in a wide range 
of areas, including machine learning. Notions from information theory are used for establishing cost 
functions for optimization in parameter estimation problems, and concepts from information theory 
are employed to estimate unknown probability distributions in the context of constrained optimization 
tasks. We will discuss such methods later in this book. 
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FIGURE 2.16 

(A) The time evolution of a realization from a white noise process. (B) The PSDs in dBs, for the two AR(1) 
sequences of Fig. 2.15. The red one corresponds to a = —0.4 and the gray one to a = —0.9. The smaller the mag- 
nitude of a, the closer the process is to a white noise, and its PSD tends to increase the power with which high 
frequencies participate. Since the PSD is the Fourier transform of the autocorrelation sequence, observe that the 
broader a sequence is in time, the narrower its Fourier transform becomes, and vice versa. 


The father of information theory is Claude Elwood Shannon (1916-2001), an American mathemati- 
cian and electrical engineer. He founded information theory with the landmark paper “A mathematical 
theory of communication,” published in the Bell System Technical Journal in 1948. However, he is also 
credited with founding digital Circuit design theory in 1937, when, as a 21-year-old Master’s degree 
student at the Massachusetts Institute of Technology (MIT), he wrote his thesis demonstrating that 
electrical applications of Boolean algebra could construet and resolve any logical, numerical relation- 
ship. So he is also credited as a father of digital computers. Shannon, while working for the national 
defense during the Second World War, contributed to the field of cryptography, converting it from an 
art to a rigorous scientific field. 

As is the case for probability, the notion of information is part of our everyday vocabulary. In this 
context, an event carries information if it is either unknown to us, or if the probability of its occurrence 
is very low and, in spite of that, it happens. For example, if one telis us that the sun shines bright during 
summer days in the Sahara desert, we could consider such a statement rather dull and useless. On the 
contrary, if somebody gives us news about snow in the Sahara during summer, that statement carries a 
lot of information and can possibly ignite a discussion concerning climate change. 

Thus, trying to formalize the notion of information from a mathematical point of view, it is reason- 
able to define it in terms of the negative logarithm of the probability of an event. If the event is certain 
to occur, it carries zero information content; however, if its probability of occurrence is low, then its 
information content has a large positive value. 
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2.5.1 DISCRETE RANDOM VARIABLES 

Information 

Given a discrete random variable x, which takes values in the set X, the Information associated with 
any value x e X is denoted as I(x) and it is defined as 


I(x) — — log P(x ): information associated with x = x e X. 


(2.145) 


Any base for the logarithm can be used. If the natural logarithm is chosen, information is measured 
in terms of nats (natural units). If the base 2 logarithm is employed, information is measured in terms 
of bits (binary digits). Employing the logarithmic function to define information is also in line with 
common sense reasoning that the information content of two statistically independent events should 
be the sum of the information conveyed by each one of them individually; / (x, y) = — log Pix, y ) = 
— logP(x) — logP(y). 

Example 2.5. We are given a binary random variable x e X = {0, 1}, and we assume that P( 1) = 
P( 0) = 0.5. We can consider this random variable as a source that generates and emits two possible 
values. The information content of each one of the two equiprobable events is 


7(0) = 7(1) = — log 2 0.5 = 1 bit. 

Let us now consider another source of random events, which generates code words comprising k bi¬ 
nary variables together. The output of this source can be seen as a random vector with binary-valued 
elements, x = [xj,..., The corresponding probability space, X, comprises K — 2 k elements. If 
all possible values have the same probability, 1 /K, then the information content of each possible event 
is equal to 

1 

I(Xi) = -log 2 — =k bits. 

K 

We observe that in the case where the number of possible events is larger, the information content of 
each individual one (assuming equiprobable events) becomes larger. This is also in line with common 
sense reasoning, since if the source can emit a large number of (equiprobable) events, then the occur- 
rence of any one of them carries more information than a source that can only emit a few possible 
events. 


Mutual and Conditional Information 

Besides marginal probabilities, we have already been introduced to the concept of conditional proba¬ 
bility. This leads to the definition of mutual information. 

Given two discrete random variables, xeT and y e y, the information content provided by the 
occurrence of the event y — y about the event x = x is measured by the mutual information, denoted as 
7(x; y ) and defined by 


7(x; y) := log 


P(x\y) . 

P(x) 


mutual information. 


(2.146) 


Note that if the two variables are statistically independent, then their mutual information is zero; this is 
most reasonable, since observing y says nothing about x. On the contrary, if by observing y it is certain 
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that x will occur, as when P (x\y) = 1, then the mutual information becomes I(x\ y) = / (x), which is 
again in line with common reasoning. Mobilizing our now familiar product rule, we can see that 

I(x-, y) = I(y; x). 

The conditional information of x given y is defined as 


7(x|y) = — log P(x\y ): conditional information. 


(2.147) 


It is straightforward to show that 

I(x; y) = I(x) — I(x\y). (2.148) 

Example 2.6. In a Communications channel, the source transmits binary symbols, x, with probability 
P(0) = P(l) = 1/2. The channel is noisy, so the received symbols y may have changed polarity, due 
to noise, with the following probabilities: 


P(y — 0|x = 0) = 1 — p, 

P{ y= i|x = 0) = p, 

P(y = i|x = i) = i — 

P(y — 0| x = i) = q. 

This example illustrates in its simplest form the effect of a Communications channel. Transmitted bits 
are hit by noise and what the receiver receives is the noisy (possibly wrong) information. The task of 
the receiver is to decide, upon reception of a sequence of symbols, which was the originally transmitted 
one. 

The goal of our example is to determine the mutual information about the occurrence of x = 0 and 
x = 1 once y = 0 has been observed. To this end, we hrst need to compute the marginal probabilities, 

P(y = 0) = P(y — 0|x = 0)P(x = 0) + P(y — 0|x= l)P(x= 1) = ^(1 - p + q ), 


and similarly. 


Thus, the mutual information is 


P<J=l) = \<X-q + p). 


and 


7(0; 0) = log 2 


= l°g2 


P(x = 0|y = 0) _ ^ P (y = 0|x = 0) 
P(x = 0) “ l0g2 P (y = 0) 

2(1 - p) 

i -p + q’ 


2 q 


7(1; 0) = log 2 


l- p + q 
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Let us now consider that p — q — 0. Then 7 (0; 0) = 1 bit, which is equal to 7 (x = 0), since the output 
specifies the input with certainty. If on the other hand p — q — 1/2, then 7 (0; 0) = 0 bits, since the noise 
can randomly change polarity with equal probability. If now p = q = 1/4, then 7(0; 0) = logo | = 
0.587 bits and 7(1; 0) = —1 bit. Observe that the mutual information can take negative values, too. 

Entropy and Average Mutual Information 

Given a discrete random variable x e X, its entropy is defined as the average information over ali 
possible outcomes, 


7/(x) := — Pix) log Pix) : entropy of x. 

xeX 


(2.149) 


Note that if P (x) = 0, P (x ) log Pix) — 0, by taking into consideration that limior logx = 0. 

In a similar way, the average mutual information between two random variables, x, y, is defined as 


/(x; y) := EE P(x, y)I(x\ y) 


xeXyey 




or 



average mutual information. 


(2.150) 


It can be shown that 


7(x; y) >0, 

and it is zero if x and y are statistically independent (Problem 2.12). 
In comparison, the conditional entropy of x given y is defined as 


77(x|y):= — P (x, y) log P (x | y): conditional entropy. 

x^Xy&y 


(2.151) 


It is readily shown, by taking into account the probability product rule, that 


7(x; y) = 77(x) - 77(x|y). 


(2.152) 


Lemma 2.1. The entropy of a random variable x e X takes its maximum value if ali possible values 
x e X are equiprobable. 


Proof. The proof is given in Problem 2.14. 


□ 
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In other words, the entropy can be considered as a measure of randomness of a source that emits 
symbols randomly. The maximum value is associated with the maximum uncertainty of what is going 
to be emitted, since the maximum value occurs if all symbols are equiprobable. The smallest value of 
the entropy is equal to zero, which corresponds to the case where all events have zero probability with 
the exception of one, whose probability to occur is equal to one. 

Example 2.7. Consider a binary source that transmits the values 1 or 0 with probabilities p and 1 — p, 
respectively. Then the entropy of the associated random variable is 

H(\) = -p\og 2 p - (1 - p)log 2 (l - p). 

Fig. 2.17 shows the graph for various values of p e [0, 1]. Observe that the maximum value occurs for 
P — 1/2. 



FIGURE 2.17 

The maximum value of the entropy for a binary random variable occurs if the two possible events have equal 
probability, p = 1/2. 


2.5.2 C0NTINU0US RANDOM VARIABLES 

All the definitions given before can be generalized to the case of continuous random variables. How- 
ever, this generalization must be made with caution. Recall that the probability of occurrence of any 
single value of a random variable that takes values in an interval in the real axis is zero. Hence, the 
corresponding information content is infinite. 

To define the entropy of a continuous variable x, we first discretize it and form the corresponding 
discrete variable xa, i.e., 


xa :=«A, if (n — 1)A < x < n A, 


(2.153) 


where A > 0. Then, 


P(x a = n A) = P(n A — A < x < n A) = 


l 


n A 

(«—DA 


p(x)dx = Ap(nA), 


(2.154) 
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where p(nA) is a number between the maximum and the minimum value of p(x), x e (n A — A,«A] 
(such a number exists by the mean value theorem of calculus). Then we can write 

+OO 

H(x a) = - Y Ap(nA) log (A/5(« A)), (2.155) 

n=—oo 


and since 


we obtain 


+oo 

Y A P( nA ) 

n=—oo 



p(x) dx = 1, 


+oo 

H(x A ) =-logA- Y A/3(«A)log(/3(«A)). 

n=—o o 


(2.156) 


Note that xa —> x as A —> 0. However, if we take the limit in Eq. (2.156), then — log A goes to 
infinity. This is the crucial difference compared to the discrete variables. 

The entropy for a continuous random variable x is defined as the limit 


H(x) := |im o (//(x A ) + log A), 


or 



p(x) log p(x) dx : 


entropy. 


(2.157) 


This is the reason that the entropy of a continuous variable is also called differential entropy. 

Note that the entropy is stili a measure of randomness (uncertainty) of the distribution describing x. 
This is demonstrated via the following example. 


Example 2.8. We are given a random variable x e [a, b]. Of all the possible PDFs that can describe 
this variable, find the one that maximizes the entropy. 

This task translates to the following constrained optimization task: 


maximize with respect to p : 


subject to: 



p(x) ln p(x)dx, 



p(x)dx — 1. 


The constraint guarantees that the function to resuit is indeed a PDF. Using calculus of variations to 
perform the optimization (Problem 2.15), it turns out that 


p(x) = 


if xe[a,b], 
0, otherwise. 
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In other words, the resuit is the uniform distribution, which is indeed the most random one since it 
gives no preference to any particular subinterval of [a, /;]. 

We will come to this method of estimating PDFs in Section 12.8.1. This elegant method for esti- 
mating PDFs comes from Jaynes [3,4]- and it is known as the maximum entropy method. In its more 
general form, more constraints are involved to fit the needs of the specific problem. 

Average Mutual Information and Conditional Information 

Given two continuous random variables, the average mutual information is defined as 


/(x; y) 



p(x, y) log 


pix, y) 

p(x)piy) 


dx dy 


and the conditional entropy of x, given y, 


(2.158) 


/»+00 P+OQ 

H (x|y) := - / 

/ p ( x , y) log p(xjy) dx dy. 

J — OO 

J — OO 


(2.159) 


Using Standard arguments and the product rule, it is easy to show that 


7(x; y) = H(x) - H(x |y) = H( y) - H( y|x). (2.160) 


Relative Entropy or Kullback-Leibler Divergence 

The relative entropy or Kullback-Leibler divergence is a quantity that has been developed within the 
context of information theory for measuring similarity between two PDFs. It is widely used in machine 
learning optimization tasks when PDFs are involved; see Chapter 12. Given two PDFs, p and q , their 
Kullback-Leibler divergence, denoted as KL(p| \q), is defined as 


f + °° pix ) 

kl (pII‘7):=/ pix)log^—dx: 
J-oo qix ) 


Kullback-Leibler divergence. 


(2.161) 


Note that 

/(x; y) = KL (pix, y)\\p(x)p(v)). 

The Kullback-Leibler divergence is not symmetric, i.e., KL( p\\q) ^ KL(<:/||p), and it can be shown 
that it is a nonnegative quantity (the proof is similar to the proof that the mutual information is non- 
negative; see Problem 12.7 of Chapter 12). Moreover, it is zero if and only if p — q. 

Note that ali we have said concerning entropy and mutual information is readily generalized to the 
case of random vectors. 


2.6 STOCHASTIC CONVERGENCE 

We will close this memory refreshing tour of the theory of probability and related concepts with some 
dehnitions concerning convergence of sequences of random variables. 
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Consider a sequence of random variables 

We can consider this sequence as a discrete-time stochastic process. Due to the randomness, a realiza- 
tion of this process, as shown by 


xo,xi,...,x n ..., 

may converge or may not. Thus, the notion of convergence of random variables has to be treated 
carefully, and different interpretations have been developed. 

Recall from our basic calculus that a sequence of numbers, x n , converges to a value x if Ve > 0 
there exists a number n (e) such that 


\x„ — x\ < e, Vn > n(e). 


(2.162) 


CONVERGENCE EVERYWHERE 

We say that a random sequence converges everywhere if every realization, x n , of the random process 
converges to a value x, according to the definition given in Eq. (2.162). Note that every realization 
converges to a different value, which itself can be considered as the outcome of a random variable x, 
and we write 


x„- > x. (2.163) 

n—>o o 

It is common to denote a realization (outcome) of a random process as x„ (£), where £ denotes a specific 
experiment. 

CONVERGENCE ALMOST EVERYWHERE 

A weaker version of convergence, compared to the previous one, is the convergence almost everywhere. 
Consider the set of outcomes £, such that 

limx„(() = x(f), n — > oo. 

We say that the sequence x„ converges almost everywhere if 

P(x n —> x) = 1, n —> oo. (2.164) 

Note that {x„ —> x} denotes the event comprising ali the outcomes such that lim x„(f) = x(f). The 
difference with the convergence everywhere is that now it is allowed to a finite or countably infinite 
number of realizations (that is, to a set of zero probability) not to converge. Often, this type of conver¬ 
gence is referred to as almost sure convergence or convergence with probability 1. 


CONVERGENCE IN THE MEAN-SQUARE SENSE 

We say that a random sequence x„ converges to the random variable x in the mean-square sense if 


E 


[|x„ ~x| 2 ] 


0, n 


oo. 


(2.165) 
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CONVERGENCE IN PROBABILITY 

Given a random sequence x„, a random variable x, and a nonnegative number e, then {|x„ — x| > e} is 
an event. We define the new sequence of numbers, / J ({|x ( , — x| > e}). We say that x„ converges to x in 
probability if the constructed sequence of numbers tends to zero, 

P({|x„ — x| > e}) —>- 0, n —> oo, Ve > 0. (2.166) 

CONVERGENCE IN DISTRIBUTION 

Given a random sequence x„ and a random variable x, let F n (x) and F(x) be the CDFs, respectively. 
We say that x„ converges to x in distribution if 

F n (x) —» F(x), n —> oo, (2.167) 

for every point x of continuity of F(x). 

It can be shown that if a random sequence converges either almost every where or in the mean-square 
sense, then it necessarily converges in probability, and if it converges in probability, then it necessarily 
converges in distribution. The converse arguments are not necessarily true. In other words, the weakest 
version of convergence is that of convergence in distribution. 


PROBLEMS 

2.1 Derive the mean and variance for the binomial distribution. 

2.2 Derive the mean and variance for the uniform distribution. 

2.3 Derive the mean and covariance matrix of the multivariate Gaussian. 

2.4 Show that the mean and variance of the beta distribution with parameters a and b are given by 


and 


ab 

(a + b) 2 (a + £>+!) 


Hint: Use the property T(fl + I) = aV(a). 

2.5 Show that the normalizing constant in the beta distribution with parameters a, b is given by 


T (a + b ) 

T(a)lW 


2.6 Show that the mean and variance of the gamma PDF 


b a 

f \aj 


x 


a— 1 


e~ bx 


a,b,x > 0, 


Gamma(.v|fl, b) = 
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are given by 


a 

£[x] = -, 
b 


cr v 


ii 

V-' 


2.7 Show that the mean and variance of a Dirichlet PDF with K variables x k , k = 1,2,.. 
parameters a k , k = 1, 2,..., K, are given by 


, K, and 


E[x k ] = =, k=l,2,...,K, 
a 

2 a k (a-a k ) ... „ 

ct x , = -, k = 1,2. K, 

k a 2 (l + a) 

didi 

C0v[x/Xy] = - i ^ j, 

a l (\ + a) 


where a = J2k=\ a k- 

2.8 Show that the sample mean, using N i.i.d. drawn samples, is an unbiased estimator with variance 
that tends to zero asymptotically, as N —> oo. 

2.9 Show that for WSS processes 

r(0) > \r(k)\, Wk e Z, 


and that for jointly WSS processes 

^u(0)r v (0) > \r U v(k)\ 2 , Wk e Z. 

2.10 Show that the autocorrelation of the output of a linear system, with impulse response w„. n <= Z, 
is related to the autocorrelation of the input WSS process via 

i~d(k) — r u (k) *w k * w*_ k - 


2.11 Show that 


ln x < x — 1. 


2.12 Show that 


/(x; y) > 0. 


Hint: Use the inequality of Problem 2.11. 

2.13 Show that if a,-, bi , i = 1, 2,..., M, are positive numbers such that 

M M 

at = 1 and bj < 1, 

i=i /=i 
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then 

M M 

— a i ln aj < — a i ln bj. 

i= l i= l 

2.14 Show that the maximum value of the entropy of a random variable occurs if all possible outcomes 
are equiprobable. 

2.15 Show that from all the PDFs that describe a random variable in an interval [a, b], the uniform 
one maximizes the entropy. 
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3.1 INTRODUCTION 

Parametric modeling is a theme that runs across the spine of this book. A number of chapters focus 
on different aspects of this important problem. This chapter provides basic definitions and concepts 
related to the task of learning when parametric models are mobilized to describe the available data. 
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As has already been pointed out in the introductory chapter, a large class of machine learning 
problems ends up as being equi valent to a function estimation/approximation task. The function is 
“learned” during the learning/training phase by digging in the information that resides in the available 
training data set. This function relates the so-called input variables to the output variable(s). Once this 
functional relationship is established, one can in turn exploit it to predict the value(s) of the output(s), 
based on measurements from the respective input variables; these predictions can then be used to 
proceed to the decision making phase. 

In parametrio modeling, the aforementioned functional dependence that relates the input to the 
output is defined via a set of parameters, whose number is fixed and a-priori known. The values of the 
parameters are unknown and have to be estimated based on the available input-output observations. In 
contrast to the parametric, there are the so-called nonparametric methods. In such methods, parameters 
may stili be involved to establish the input-output relationship, yet their number is not fixed; it depends 
on the size of the data set and it grows with the number of observations. Nonparametric methods will 
also be treated in this book (e.g., Chapters 11 and 13). However, the emphasis in the current chapter 
lies on parametric models. 

There are two possible paths to deal with the uncertainty imposed by the unknown values of the 
involved parameters. According to the first one, parameters are treated as deterministic nonrandom 
variables. The task of learning is to obtain estimates of their unknown values. For each one of the 
parameters a single value estimate is obtained. The other approach has a stronger statistical flavor. The 
unknown parameters are treated as random variables and the task of learning is to infer the associated 
probability distributions. Once the distributions have been learned/inferred, one can use them to make 
predictions. Both approaches are introduced in the current chapter and are treated in more detail later 
on in various chapters of the book. 

Two of the major machine learning tasks, namely, regression and classification, are presented and 
the main directions in dealing with these problems are exposed. Various issues that are related to 
the parameter estimation task, such as estimator efficiency, bias-variance dilemma, overfitting, and 
the curse of dimensionality, are introduced and discussed. The chapter can also be considered as a 
road map to the rest of the book. However, instead of just presenting the main ideas and directions 
in a rather “dry” way, we chose to deal and work with the involved tasks by adopting simple models 
and techniques, so that the reader gets a better feeling of the topic. An effort was made to pay more 
attention to the scientihc notions than to algebraic manipulations and mathematical details, which will, 
unavoidably, be used to a larger extent while “embroidering” the chapters to follow. 

The least-squares (LS), maximum likelihood (ML), regularization, and Bayesian inference tech¬ 
niques are presented and discussed. An effort has been made to assist the reader to grasp an informative 
view of the big picture conveyed by the book. Thus, this chapter could also be used as an overview 
introduction to the parametric modeling task in the realm of machine learning. 


3.2 PARAMETER ESTIMATION: THE DETERMINISTIC POINT OF VIEW 

The task of estimating the value of an unknown parameter vector, 0 , has been at the center of interest 
in a number of application areas. For example, in the early years at university, one of the very first 
subjects any student has to study is the so-called curve fitting problem. Given a set of data points, one 
must find a curve or a surface that “fits” the data. The usual path to follow is to adopt a functional form, 
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such as a linear function or a quadratic one, and try to estimate the associated unknown parameters so 
that the graph of the function “passes through” the data and follows their deployment in space as close 
as possible. Figs. 3.1 A and B are two such examples. The data lie in the R 2 space and are given to us 
as a set of points (y n ,x n ),n = 1,2 ,..., N. The adopted functional form for the curve corresponding to 
Fig. 3.1 A is 

y = f$(x) = 6 0 + Oix, (3.1) 

and for the case of Fig. 3. 1B it is 

y — fe(x) — 6 0 + &1X + &2X 2 . (3.2) 

The unknown parameter vectors are 0 — [0o, 0\\ T and 0 — [0o, Qi ,02] T , respectively. In both cases, the 
parameter values, which define the curves drawn by the red lines, provide a much better fit compared 
to the values associated with the black ones. In both cases, the task comprises two steps: (a) first adopt 
a specific parametrio functional form which we reckon to be more appropriate for the data at hand and 
(b) estimate the values of the unknown parameters in order to obtain a “good” fit. 




(A) (B) 


FIGURE 3.1 

(A) Fitting a linear function. (B) Fitting a quadratic function. The red lines are the optimized ones. 

In the more general and formal setting, the task can be defined as follows. Given a set of data points, 
(y n ,x, A, y n e R. x n e R ; , n = 1 , 2 ,..., N, and a parametric 1 set of functions, 

F:= {/*(■): 0e.4cR*j, (3.3) 

find a function in T ', which will be denoted as /(•) := /#„(•)> such that given a value of x e R ; , f(x) 
best approximates the corresponding value y e R. The set A is a constraint set in case we wish to con- 
strain the unknown K parameters to lie within a specific region in R^ . Constraining the parameters to 


1 Recall from the Notations section that we use the Symbol /(■) to denote a function of a single argument; f(x) denotes the 

value that the function /(■) has at x. 
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be within a subset of the parameter space is almost omnipresent in machine learning. We will deal 
with constrained optimization later on in this chapter (Section 3.8). We start our discussion by consid- 
ering y to be a real variable, y e M, and as we move on and understand better the various “secrets,” 
we will allow it to move to higher-dimensional spaces. The value 0* is the value that results from an 
estimation procedure. The values of 0 * that deline the red curves in Figs. 3.1 A and B are 

0* = [-O.5,lf, = [—3, —2, l] 3 ", (3.4) 


respectively. 

To reach a decision with respect to the choice of T is not an easy task. For the case of the data in 
Fig. 3.1, we were a bit “lucky.” First, the data live in the two-dimensional space, where we have the 
luxury of visualization. Second, the data were scattered along curves whose shape is pretty familiar 
to us; hence, a simple inspection suggested the proper family of functions, for each one of the two 
cases. Obviously, real life is hardly as generous as that and in the majority of practical applications, 
the data reside in high-dimensional spaces and/or the shape of the surface (hypersurface, for spaces 
of dimensionality higher than three) can be quite complex. Hence, the choice of T, which dictates the 
functional form (e.g., linear, quadratic, etc.), is not easy. In practice, one has to use as much a priori 
information as possible concerning the physical mechanism that underlies the generation of the data, 
and most often use different families of functions and finally keep the one that results in the best 
performance, according to a chosen criterion. 

Having adopted a parametric family of functions, T, one has to get estimates for the set of the 
unknown parameters. To this end, a measure of fitness has to be adopted. The more classical approach 
is to adopt a loss function, which quantifies the deviation/error between the measured value of y and 
the predicted one using the corresponding measurements x, as in fo(x). In a more formal way, we 
adopt a nonnegative (loss) function. 


£(•, •): M x M 


[0, oo), 


and compute 0* so as to minimize the total loss, or as we say the cost, over all the data points, or 


/(■):=/(?,(■): 0* = argmin J{0), 
OeA 


(3.5) 


where 


J(0) :=^2c(y n , fe(x„)), 


(3.6) 


n =1 


assuming that a minimum exists. Note that, in general, there may be more than one optimal value 0 *, 
depending on the shape of J(0). 

As the book evolves, we are going to see different loss functions and different parametric families 
of functions. For the sake of simplicity, for the rest of this chapter we will adhere to the squared error 
loss function, 

£{y, fo(x)) = (y- fe(x)) 2 . 


and to the linear class of functions. 
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The squared error loss function is credited to the great mathematician Cari Frederich Gauss, who 
proposed the fundamentals of the least-squares (LS) method in 1795 at the age of 18. However, it 
was Adrien-Marie Legendre who first published the method in 1805, working independently. Gauss 
published it in 1809. The strength of the method was demonstrated when it was used to predict the 
location of the asteroid Ceres. Since then, the squared error loss function has “haunted” ali scientific 
fields, and even if it is not used directly, it is, most often, used as the Standard against which the 
performance of more modern alternatives are compared. This success is due to some nice properties 
that this loss criterion has, which will be explored as we move on in this book. 

The combined choice of linearity with the squared error loss function turns out to simplify the 
algebra and hence becomes very pedagogic for introducing the newcomer to the various “secrets” 
that underlie the area of parameter estimation. Moreover, understanding linearity is very important. 
Treating nonlinear tasks, most often, turns out to finally resort to a linear problem. Take, for example, 
the nonlinear model in Eq. (3.2) and consider the transformation 

et 2 . (3.7) 



Then Eq. (3.2) becomes 

y = 6 o+Oi<pi(x) + d2^2(x). (3.8) 

That is, the model is now linear with respect to the components <pk(x), k — 1, 2, of the two-dimensional 
image <f)(x) of x. As a matter of fact, this simple trick is at the heart of a number of nonlinear methods 
that will be treated later on in the book. No doubt, the procedure can be generalized to any number K 
of functions 4>k(x), k = 1, 2,..., K, and besides monomials, other types of nonlinear functions can be 
used, such as exponentials, splines, and wavelets. In spite of the nonlinear nature of the input-output 
dependence modeling, we stili consider this model to be linear, because it retains its linearity with 
respect to the involved unknown parameters 6k, k = 1,2,.... K. Although for the rest of the chapter 
we will adhere to linear functions, in order to keep our discussion simple, everything that will be said 
applies to nonlinear ones as well. Ali that is needed is to replace x with <j>(x) [<p\ (x), ... ,<pK (A)] 7 ' e 
WL k . 

In the sequel, we will present two examples in order to demonstrate the use of parametric modeling. 
These examples are generic and can represent a wide class of problems. 


3.3 LINEAR REGRESSION 

In statistics, the term regression was coined to define the task of modeling the relationship of a depen¬ 
dent random variable y, which is considered to be the response of a system, when this is activated by 
a set of random variables xi, X 2 ,..., x/, which will be represented as the components of an equivalent 
random vector x. The relationship is modeled via an additive disturbance or noise term q. The block 
diagram of the process, which relates the involved variables, is given in Fig. 3.2. The noise variable q 
is an unobserved random variable. The goal of the regression task is to estimate the parameter vector 6, 
given a set of measurements/observations (>’„, x„), n — 1,2,..., /V, that we have at our disposal. This 
is also known as the training data set. The dependent variable is usually known as the output variable 
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and the vector x as the input vector or the regressor. If we model the system as a linear combiner, the 
dependence relationship is written as 

y = 0 q + 9\x\ H- \-9ixi + r\=0 o + 0 T x + r). (3.9) 

The parameter 9q is known as the bias or the intercept. Usually, this term is absorbed by the parameter 
vector 0 with a simultaneous increase in the dimension of x by adding the constant 1 as its last element. 
Indeed, we can write 

6o + 0 T x+x) = [0 T ,e o ] 

From now on, the regression model will be written as 

y = O^x + r), (3.10) 

and, unless otherwise stated, this notation means that the bias term has been absorbed by 0 and x has 
been extended by adding 1 as an extra component. Because the noise variable is unobserved, we need 
a model to be able to predict the output value of y, given an observed value, x, of the random vector x. 
In linear regression, given an estimate 0 of 0, we adopt the following prediction model: 

y — 0 q + 9\x\ H-f §ixi := 0 T x. (3.11) 

Using the squared error loss function, the estimate 0 is set equal to 0*, which minimizes the square 

difference between y n and v„ over the set of the available observations; that is, one minimizes, with 
respect to 0 , the cost function 

N 

A#) = ^(y»-#V. (3.12) 

n= 1 

We start our discussion by considering no constraints; hence, we set A— \9 K , and we are going to 
search for Solutions that lie anywhere in PA. Taking the derivative (gradient) with respect to 0 (see 
Appendix A) and equating to the zero vector, 0, we obtain (Problem 3.1) 

N \ N 

'Ji I 0 = y>iX n • (3.13) 

n= 1 / / 2=1 



Xl,X 2 , ... ,X(] r 

System 



/*(•) 



FIGURE 3.2 


Block diagram showing the input-output relation in a regression model. 
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Note that the sum on the left-hand side is an (/ + 1) x (/ + 1) matrix, being the sum of N outer vector 
products, i.e., x n x^. As we know from linear algebra, we need at least N = 1 + 1 observation vectors 
to guarantee that the matrix is invertible, assuming of course linear independence among the vectors 
(see, e.g., [35]). 

For those who are not (yet) very familiar with working with vectors, note that Eq. (3.13) is just 
a generalization of the scalar case. For example, in a “scalar” world, the input-output pairs would 
comprise scalars (y„, x n ) and the unknown parameter would also be a scalar, 9. The cost function 
would be Yln=\(yn ~ ® x n) 2 - Taking the derivative and setting it equal to zero leads to 



« = 1 y» x n 

T N X 2 
L-,n x n 


A more popular way to write Eq. (3.13) is via the so-called input matrix X, defined as the N x (/ + 
1) matrix which has as rows the (extended) regressor vectors xj t , n= 1,2,..., N, expressed as 


T 

x \ 


x n . 

• x \l 

1 " 

T 

x 2 

= 

X21 ■ 

■ X 2I 

1 

T 

l X N J 


_ -*iVl • 

■ x Nl 

1 


Then it is straightforward to see that Eq. (3.13) can be written as 

(X T X)0 = X T y , 


where 


y '■= [vi, yi, ■■■, vn] 7 ■ 


Indeed, 


and similarly. 


X T X = [xi,..., x N ] 






_ yN _ 


Thus, finally, the LS estimate is given by 


N 

J2 x " x "’ 

n =1 


N 

^ ] ynXn ■ 

n= 1 


(3.14) 


(3.15) 

(3.16) 


6 = (X T X) l X T y : the LS estimate, 


(3.17) 
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FIGURE 3.3 

The squared error loss function has a unique minimum at the point 6 *. 
assuming, of course, that (X T X)~ l exists. 

In other words, the obtained estimate of the parameter vector is given by a linear set of equations. 
This is a major advantage of the squared error loss function, when applied to a linear model. Moreover, 
this solution is unique , provided that the (/ + 1) x (/ + 1) matrix X 1 X is invertible. The uniqueness is 
due to the parabolic shape of the graph of the sum of squared errors cost function. This is illustrated 
in Fig. 3.3 for the two-dimensional space. It is readily observed that the graph has a unique minimum. 
This is a consequence of the fact the sum of squared errors cost function is a strictly convex one. Issues 
related to convexity of loss functions will be treated in more detail in Chapter 8. 

Example 3.1. Consider the system that is described by the following model: 


y = &o + 0ixi + 02X2 + q := [0.25, -0.25,0.25] 


xi 

X2 

1 


+ T), 


(3.18) 


where q is a Gaussian random variable of zero mean and variance a 2 — 1. Observe that the generated 
data are spread, due to the noise, around the plane that is dehned as 


f(x) = 6() + d\X\ + 02X2, (3.19) 

in the two-dimensional space (see Fig. 3.4C). 

The random variables xi and X 2 are assumed to be mutually independent and uniformly distributed 
over the interval [0, 10]. Furthermore, both variables are independent of the noise variable q. We gen¬ 
erate N — 50 i.i.d. points for each of the three random variables, i.e., xi, X 2 , q. For each triplet, we use 
Eq. (3.18) to generate the corresponding value, y, of y. In this way, the points (y n , r„), n = 1,2,, 50, 
are generated, where each observation x n lies in M 2 . These are used as the training points to obtain the 


2 


Independent and identically distributed samples (see also Section 2.3.2, following Eq. (2.84)). 
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LS estimates of the parameters of the linear model 

y = Oo + 0i*i + 02*2- (3.20) 

Then we repeat the experiments with a 2 = 10. Note that, in general, Eq. (3.20) defines a different plane 

than the original (3.19). 

The values of the LS optimal estimates are obtained by solving a 3 x 3 linear system of equations. 

They are 

(a) §o = 0.028, 0i = 0.226, 0 2 = -0.224, 

(b) 0o = O.914, 0i = 0.325, 0 2 =-0.477, 

for the two cases, respectively. Figs. 3.4A and B show the recovered planes. Observe that in the case 

of Fig. 3.4A, corresponding to a noise variable of small variance, the obtained plane follows the data 

points much closer than that of Fig. 3.4B. 

Remarks 3.1. 

• The set of points (y n , x„\ ,..., x„i), n — 1, 2,..., N, lie on a hyperplane in the K ,+1 space. Equiva- 
lently, they lie on a hyperplane that crosses the origin and, thus, it is a linear subspace in the extended 
space R /+2 when one absorbs 0o in 0, as explained previously. 

• Note that the prediction model in Eq. (3 .11 ) could stili be used, even if the true system structure does 
not obey the linear model in Eq. (3.9). For example, the true dependence between y and x may be a 
nonlinear one. In such a case, the predictions of y based on the model in Eq. (3.11) may, however, 
not be satisfactory. It all depends on the deviation of our adopted model from the true structure of 
the system that generates the data. 

• The prediction performance of the model also depends on the statistical properties of the noise 
variable. This is an important issue. We will see later on that, depending on the statistical properties 
of the noise variable, some loss functions and methods may be more suitable than others. 

• The two previous remarks suggest that in order to quantify the performance of an estimator some 
related criteria are necessary. In Section 3.9, we will present some theoretical touches that shed light 
on certain aspects related to the performance of an estimator. 


3.4 CLASSIFICATION 

Classification is the task of predicting the class to which an object, known as pattern , belongs. The 
pattern is assumed to belong to one and only one among a number of a priori known classes. Each 
pattern is uniquely represented by a set of values, known as features. One of the early stages in design- 
ing a classification system is to select an appropriate set of feature variables. These should “encode” 
as much class-discriminatory information so that, by measuring their value for a given pattern, we are 
able to predict, with high enough probability, the class of the pattern. Selecting the appropriate set of 
features for each problem is not an easy task, and it comprises one of the most important areas within 
the field of pattern recognition (see, e.g., [12,37]). Having selected, say, / feature (random) variables, 
xj, x 2 ,..., x/, we stack them as the components of the so-called feature vector x e p/. The goal is to 
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(C) 


FIGURE 3.4 

Fitting a plane using the LS method for (A) a low variance and (B) a high variance noise case. In (C), the true plane 
that corresponds to the true coefficients is shown for comparison reasons. Note that when the noise variance in the 
regression model is smaller, a better fit to the data set is obtained. 


design a classifier, such as a function f(x), or equivalently a decision surface, f(x ) = 0, in R , so 
that given the values in a feature vector x, which corresponds to a pattern, we will be able to predict 
the class to which the pattern belongs. 


3 


In the more general case, a set of functions. 
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To formulate the task in mathematical terms, each class is represented by the class label variable y. 
For the simple two-class classification task, this can take either of two values, depending on the class 
(e.g., 1,-1 or 1,0, etc.). Then, given the value of x, corresponding to a specific pattern, its class label 
is predicted according to the rule 

y = <t>(f(x)), 

where (j> (■) is a nonlinear function that indicates on which side of the decision surface f (x) = 0 lies 
x. For example, if the class labeis are ±1, the nonlinear function is chosen to be the sign function, or 
</)(•) = sgn(-). It is now ciear that what we have said so far in the previous section can be transferred 
here and the task becomes that of estimating a function /(■), based on a set of training points (y„, x n ) e 
D x R*, n = 1,2,..., N, where D denotes the discrete set in which y lies. Function /(•) is selected so 
as to belong in a specific parametric class of functions, T', and the goal is, once more, to estimate the pa- 
rameters so that the deviation between the true class labeis, y n , and the predicted ones, y„, is minimum 
according to a preselected criterion. So, is the classification any different from the regression task? 

The answer to the previous question is that although they may share some similarities, they are 
different. Note that in a classification task, the dependent variables are of a discrete nature, in contrast 
to the regression, where they lie in an interval. This suggests that, in general, different techniques have 
to be adopted to optimize the parameters. For example, the most obvious choice for a criterion in a 
classification task is the probability of error. However, in a number of cases, one could attack both 
tasks using the same type of loss functions, as we will do in this section; even if such an approach is 
adopted, in spite of the similarities in their mathematical formalism, the goals of the two tasks remain 
different. 

In the regression task, the function /(■) has to “explain” the data generation mechanism. The cor¬ 
responding surface in the (y, x) space R /+l should develop so as to follow the spread of the data in the 
space as closely as possible. In contrast, in classification, the goal is to place the corresponding surface 
f(x) = 0 in R , so as to separate the data that belong to different classes as much as possible. The 
goal of a classifier is to partition the space where the feature vectors lie into regions and associate each 
region with a class. Fig. 3.5 illustrates two cases of classification tasks. The first one is an example of 
two linearly separable classes, where a straight line can separate the two classes; the second case is an 
example of two nonlinearly separable classes, where the use of a linear classifier would have failed to 
separate the two classes. 

Let us now make what we have said so far more concrete. We are given a set of training patterns, 
x n e R , n — 1,2,..., N, that belong to either of two classes, say, u>\ and a> 2 - The goal is to design a 
hyperplane 


f(x) =6o + 0i*i H-h dixi 

= 0 T x = O, 

where we have absorbed the bias 0q in 0 and we have extended the dimension of jc, as explained 
before. Our aim is to place this hyperplane in between the two classes. Obviously, any point lying on 
this hyperplane scores a zero, f(x) — 0, and the points lying on either side of the hyperplane score 
either a positive (f(x) > 0) or a negative value (/(*) < 0), depending on which side of the hyperplane 
they lie. We should therefore train our classifier so that the points from one class score a positive value 
and the points of the other a negative one. This can be done, for example, by labeling all points from 
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FIGURE 3.5 


Examples of two-class classification tasks. (A) A linearly separable classification task. (B) A nonlinearly separable 
classification task. The goal of a classifier is to partition the space into regions and associate each region with a 
class. 


class a>i with y n = 1, V/t: x n e co i, and all points from class ®2 with y n = — 1, 'in : x„ e a> 2 - Then the 
squared error is mobilized to compute 0 so as to minimize the cost, i.e., 

N 2 

J (0) = ^2(yn-O T Xn^ ■ 

n =1 

The solution is exactly the same as Eq. (3.13). Fig. 3.6 shows the resulting LS classifiers for two cases 
of data. Observe that in the case of Fig. 3.6B, the resulting classifier cannot classify correctly all the 
data points. Our desire to place all data which originate from one class on one side and the rest on the 
other cannot be satisfied. All that the LS classifier can do is to place the hyperplane so that the sum 
of squared errors between the desired (true) values of the labeis y„ and the predicted outputs 0 T x n is 
minimal. It is mainly for cases such as overlapping classes, which are usually encountered in practice, 
where one has to look for an alternative to the squared error criteria and methods, in order to serve better 
the needs and the goals of the classification task. For example, a reasonable optimality criterion would 
be to minimize the probability of error, that is, the percentage of points for which the true labeis y n and 
the ones predicted by the classifier, y n , are different. Chapter 7 presents methods and loss functions 
appropriate for the classification task. In Chapter 11 support vector machines are discussed and in 
Chapter 18 neural networks and deep learning methods are presented. These are currently among the 
most powerful techniques for classification problems. 

GENERATIVE VERSUS DISCRIMINATIVE LEARNING 

The path that we have taken in order to introduce the classification task was to consider a functional 
dependence between the output variable (label) y and the input variables (features) x. The involved 
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FIGURE 3.6 

Design of a linear classifier, 6q + 9 \x\ + 62x2 = 0 , based on the squared error loss function. (A) The case of two 
linearly separable classes. (B) The case of nonseparable classes. In the latter case, the classifier cannot separate fully 
the two classes. All it can do is to place the separating (decision) line so as to minimize the deviation between the 
true labeis and the predicted output values in the LS sense. 


parameters were optimized with respect to a cost function. This path of modeling is also known as 
discriminative learning. We were not concerned with the statistical nature of the dependence that ties 
these two sets of variables together. In a more general setting, the term discriminative learning is also 
used to cover methods that model directly the posterior probability of a class, represented by its label 
y, given the feature vector x, as in P(y\x). The common characteristic of all these methods is that 
they bypass the need of modeling the input data distribution explicitly. From a statistical point of view, 
discriminative learning is justified as follows. 

Using the product rule for probabilities, the joint distribution between the input data and their 
respective labeis can be written as 


P(y,x) = P(y\x)p(x). 

In discriminative learning, only the first of the two terms in the product is considered; a functional form 
is adopted and parameterized appropriately as P(v|jc; 6 ). Parameters are then estimated by optimizing 
a cost function. The distribution of the input data is ignored. Such an approach has the advantage that 
simpler models can be used, especially if the input data are described by a joint probability density 
function (PDF) of a complex form. The disadvantage is that the input data distribution is ignored, 
although it can carry important information, which could be exploited to the benefit of the overall 
performance. 

In contrast, the alternative path, known as generative learning, exploits the input data distribution. 
Once more, employing the product rule, we have 


P(y,x) = p(x\y)P(y), 
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where P(y) is the probability concerning the classes and p(x\y) is the distribution of the input given 
the class label. For such an approach, we end up with one distribution per class, which has to be 
learned. In parametric modeling, a set of parameters is associated with each one of these conditional 
distributions. Once the input-output joint distribution has been learned, the prediction of the class label 
of an unknown pattern, jc, is performed based on the a posteriori probability, i.e., 


P(y\x) 


p(y,x ) 

p(x) 


p(y,x) 

ZyPly.xY 


We will return to these issues in more detail in Chapter 7. 


3.5 BIASED VERSUS UNBIASED ESTIMATION 

In supervised learning, we are given a set of training points, (y n , x n ), n — 1,2,..., /V, and we return an 
estimate of the unknown parameter vector, say, 6. However, the training points themselves are random 
variables. If we are given another set of N observations of the sume random variables, these are going 
to be different, and obviously the resulting estimate will also be different. In other words, by changing 
our training data, different estimates will resuit. Hence, we can assume that the resulting estimate, of 
a fixed yet unknown parameter, is itself a random variable. This, in turn, poses questions on how good 
an estimator is. Undoubtedly, each time the obtained estimate is optimal with respect to the adopted 
loss function and the specific training set used. However, who guarantees that the resulting estimates 
are “close” to the true value, assuming that there is one? In this section, we will try to address this task 
and to illuminate some related theoretical aspects. Note that we have already used the term estimator 
in place of the term estimate. Let us elaborate a bit on their difference, before presenting more details. 

An estimate, such as 0. has a specific value, which is the resuit of a function acting on a set of 
observations on which our chosen estimate depends (see Eq. (3.17)). In general, we could generalize 
Eq. (3.17) and write 

0 = g(y,X). 

However, once we allow the set of observations to change randomly, and the estimate becomes itself a 
random variable, we write the previous equation in terms of the corresponding random variables, i.e., 


e = g(y,X), 


and we refer to this functional dependence as the estimator of the unknown vector 0. 

In order to simplify the analysis and focus on the insight behind the methods, we will assume that 
our parameter space is that of real numbers, R. We will also assume that the model (i.e., the set of 
functions T) which we have adopted for modeling our data is the correct one and that the (unknown 
to us) value of the associated true parameter is equal to 0 o . Let 0 denote the random variable of the 


4 


Not to be confused with the intercept; the subscript here is “o” and not “0.” 
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associated estimator. Adopting the squared error loss function to quantify deviations, a reasonable 
criterion to measure the performance of an estimator is the mean-square error (MSE), 

MSE=e[( 0-0 o ) 2 ] , (3.21) 

where the mean E is taken over ali possible training data sets of size N. If the MSE is small, then we 
expect that, on average, the resulting estimates are close to the true value. Note that although f)„ is not 
known, studying the way in which the MSE depends on various terms will stili help us to learn how to 
proceed in practice and unravel possible paths that one can follow to obtain good estimators. 

Indeed, this simple and “natural” looking criterion hides some interesting surprises for us. Let us 
insert the mean value E[0] of 0 in Eq. (3.21). We get 


E |(0-E[0]) + (E[0]-0 o )) 2 


E (§-E[0]) 2 +(E[0]-0 O ) 2 , 

V .0 

(3.22) 


- v - • 9 

Variance Bias 


where, for the second equality, we have taken into account that the mean value of the product of the 
two involved terms turns out to be zero, as can be readily seen. What Eq. (3.22) suggests is that the 
MSE consists of two terms. The first one is the variance around the mean value and the second one is 
due to the bias, that is, the deviation of the mean value of the estimator from the true one. 


3.5.1 BIASED OR UNBIASED ESTIMATION? 

One may naively think that choosing an estimator that is unbiased, as is E[0] = 0 O , such that the second 
term in Eq. (3.22) becomes zero, is a reasonable choice. Adopting an unbiased estimator may also 
be appealing from the following point of view. Assume that we have L different training sets, each 
comprising N points. Let us denote each data set by T>j, i = 1,2,..., L. For each one, an estimate di, 
i = 1, 2,..., L, will resuit. Then, we form a new estimator by taking the average value. 


0 (i) := 



This is also an unbiased estimator, because 


E[0 (i) ] = |^E[0,-]=0 o . 

^ i=i 

Moreover, assuming that the involved estimators are mutually uncorrelated, i.e., 


E[(0/-0 o )(0;-0 o )]=O, 
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and of the same variance, a 2 , the variance of the new estimator is now much smaller (Problem 3.2), 
i.e., 



Hence, by averaging a large number of such unbiased estimators, we expect to get an estimate close 
to the true value. However, in practice, data are a commodity that is not always abundant. As a matter 
of fact, very often the opposite is true and one has to be very careful about how to exploit them. In 
such cases, where one cannot afford to obtain and average a large number of estimators, an unbiased 
estimator may not necessarily be the best choice. Going back to Eq. (3.22), there is no reason to suggest 
that by making the second term equal to zero, the MSE (which, after ali, is the quantity of interest to 
us) becomes minimal. Indeed, let us look at Eq. (3.22) from a slightly different point of view. Instead of 
computing the MSE for a given estimator, let us replace 9 with 0 in Eq. (3.22) and compute an estimator 
that will minimize the MSE with respect to 0 directly. In this case, focusing on unbiased estimators, 
or E[0] = 9 0 , introduces a constraint to the task of minimizing the MSE, and it is well known that an 
unconstrained minimization problem always results in loss function values that are less than or equal 
to any value generated by a constrained counterpart, i.e., 


minMSE(0) < min MSE(0), (3.23) 

e 0: E[0]=6>„ 

where the dependence of the MSE on the estimator 0 in Eq. (3.23) is explicitly denoted. 

Let us denote by 0mvu a solution of the task mine : E[0]=6»„ MSE(0). It can be readily verified by 
Eq. (3.22) that 0mvu is an unbiased estimator of minimum variance. Such an estimator is known as 
the minimum variance unbiased (MVU) estimator and we assume that such an estimator exists. An 
MVU does not always exist (see [20], Problem 3.3). Moreover, if it exists it is unique (Problem 3.4). 
Motivated by Eq. (3.23), our next goal is to search for a biased estimator, which results, hopefully, in 
a smaller MSE. Let us denote this estimator as 0/,. For the sake of illustration, and in order to limit our 
search for 0/,, we consider here only 9/,s that are scalar multiples of 0\tvu, so that 


0& = (1 + a)0MVU. (3.24) 

where a e R. is a free parameter. Note that E[0/,] = (1 + a)0 o . By substituting Eq. (3.24) into Eq. (3.22) 
and after some simple algebra we obtain 


MSE(0 fc ) = (1 + a) 2 MSE(0 M vu) + a 2 0 2 . 

In order to get MSE(0fo) < MSE(0 mvuT a must be in the range (Problem 3.5) of 

2MSE(0 MV u) 

-^-< a < 0 . 

MSE(0 M vu) + 0 2 


(3.25) 


(3.26) 


Itiseasy to verify that this range implies that |l + a| < 1. Hence, |0fe| = |(l+a)0MVul < |0mvuI- We 
can go a step further and try to compute the optimum value of a, which corresponds to the minimum 






3.6 THE CRAMER-RAO LOWER BOUND 83 


MSE. By taking the derivative of MSE(O^) in Eq. (3.25) with respect to or, it turns out (Problem 3.6) 
that this occurs for 

MSE(0mvu) 1 

«* =---- =-75-■ (3-27) 

MSE(Gmvu) + Ql 1 +- % - 

MSE(0mvu) 

Therefore, we have found a way to obtain the optimum estimator among those in the set {9/, = (1 + 
a)0MVU : or e R|, which results in the minimum MSE. This is true, but as many nice things in life, this 
is not, in general, realizable. The optimal value for a is given in terms of the unknown 0 „! However, 
Eq. (3.27) is useful in a number of other ways. First, there are cases where the MSE is proportional to 
6g \ hence, this formula can be used. Also, for certain cases, it can be used to provide useful bounds 
[19]. Moreover, as far as we are concerned in this book, it says something very important. If we want 
to do better than the MVU, then, looking at the text following Eq. (3.26), a possible way is to shrink 
the norm of the MVU estimator. Shrinking the norm is a way of introducing bias into an estimator. We 
will discuss ways to achieve this in Section 3.8 and later on in Chapters 6 and 9. 

Note that what we have said so far is readily generalized to parameter vectors. An unbiased param- 
eter vector estimator satisfies 

E[0] = 0 O , 

and the MSE around the true value 0 O is defined as 

/ 

MSE = E [(§ - 0 o ) r (0 - 0 O )] =J2 E [$* - G oif\ • 

i=i 

Looking carefully at the previous definition reveals that the MSE for a parameter vector is the sum of 
the MSEs of the components 0/, i = 1,2,... ,1, around the corresponding true values 0 ol . 


3.6 THE CRAMER-RAO LOWER BOUND 

In the previous sections, we saw how one can improve the performance of the MVU estimator, provided 
that this exists and it is known. However, how can one know that an unbiased estimator that has been 
obtained is also of minimum variance? The goal of this section is to introduce a criterion that can 
provide such information. 

The Cramer-Rao lower bound [9,3 1 ] is an elegant theorem and one of the most well-known tech- 
niques used in statistics. It provides a lower bound on the variance of any unbiased estimator. This is 
very important because (a) it offers the means to assert whether an unbiased estimator has minimum 
variance, which, of course, in this case coincides with the corresponding MSE in Eq. (3.22), (b) if this 
is not the case, it can be used to indicate how far away the performance of an unbiased estimator is 
from the optimal one, and finally (c) it provides the designer with a tool to know the best possible 
performance that can be achieved by an unbiased estimator. Because our main purpose here is to focus 
on the insight and physical interpretation of the method, we will deal with the simple case where our 
unknown parameter is a real number. The general form of the theorem, involving vectors, is given in 
Appendix B. 
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We are looking for a bound of the variance of an unbiased estimator, whose randomness is due to 
the randomness of the training data, as we change from one set to another. Thus, it does not come as 
a surprise that the bound involves the joint PDF of the data, parameterized in terms of the unknown 
parameter 0. Let X = {x i, X 2 , ,xn} denote the set of N observations, corresponding to a random 
vector 5 x that depends on the unknown parameter. Also, let the respective joint PDF of the observations 
be denoted as p(X\0). 

Theorem 3.1. It is cissumed that the joint PDF satisfies the following regularity condition: 


E 


3 ln p(X\ 0) 
dO 


we. 


(3.28) 


This regularity condition is a weak one and holds for most ofthe cases in practice (Problem 3.7). Then, 
the variance ofany unbiased estimator, 0, must satisfy the following inequality: 


2 i A 

er- >-: Cramer-Rao lower bound , 

6 ~ 1(9) 


(3.29) 


where 


1 ( 6 ) :=-E 


3 2 ln p(X;6) 
302 


(3.30) 


Moreover, the necessary and sufficient condition for obtaining an unbiased estimator that achieves the 
bound is the existence of a function g(-) such that for all possible values ofd. 


31n p(X;9) 
80 — 


= I(0)(g(X)-6). 


The MVU estimate is then given by 

6 — g(X) := g(x i,x 2 , ■ ■ -,x N ), 


(3.31) 


(3.32) 


and the variance ofthe respective estimator is equal to 1/7(0). 

When an MVU estimator attains the Cramer-Rao bound, we say that it is efficient. All the expec- 
tations before are taken with respect to p(X; 0). The interested reader may find more on the topic in 
more specialized books on statistics [20,27,36]. 

Example 3.2. Let us consider the simplified version of the linear regression model in Eq. (3.10), where 
the regressor is real-valued and the bias term is zero. 


y n = 9x + q n , 


(3.33) 


5 Note that here x is treated as a random quantity in a generat setting, and not necessarily in the context of the regression/clas- 
sification tasks. 
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where we have explicitly denoted the dependence on «, which runs over the number of available obser- 
vations. Note that in order to further simplify the discussion, we have assumed that our N observations 
are the resuit of different realizations of the noise variable only, and that we have kept the value of the 
input, x, constant, which can be considered to be equal to one, without harming generality; that is, our 
task is reduced to that of estimating a parameter from its noisy measurements. Thus, for this case, the 
observations are the scalar outputs y n , n — 1,2,..., N, which we consider to be the components of a 
vector y e R' v . We further assume that //„ are samples of a white Gaussian noise (Section 2.4) with 
zero mean and variance equal to a~; that is, successive samples are i.i.d. drawn and, hence, they are 
mutually uncorrelated (E [rjiii = 0, i ^ j). Then, the joint PDF of the output observations is given 
by 



(3.34) 


or 



(3.35) 


We will derive the corresponding Cramer-Rao bound. Taking the derivative of the logarithm with 
respect to 6 we have 


M 



(3.36) 


where 



that is, the sample mean of the observations. The second derivative, as required by the theorem, is 
given by 


3 2 ln p(y\ 0) 

dd 2 


N 


and hence, 


N 

1 ( 0 )=—. 

0 - 


(3.37) 


Eq. (3.36) is in the form of Eq. (3.31), with g(y) = y; thus, an efficient estimator can be obtained 
and the lower bound of the variance of any unbiased estimator, for our data model of Eq. (3.33), 
is 



(3.38) 
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We can easily verify that the corresponding estimator y is indeed an unbiased one under the adopted 
modelofEq. (3.33), 

I N . N 

E[ y ] = x E E ^] = ^ E E t 0 +^= 6 ■ 

n =1 n =1 

Moreover, the previous formula, combined with Eq. (3.36), also establishes the regularity condition, as 
required by the Cramer-Rao theorem. 

The bound in Eq. (3.38) is a very natural resuit. The Cramer-Rao lower bound depends on the 
variance of the noise source. The higher this is, and therefore the higher the uncertainty of each ob- 
servation with respect to the value of the true parameter is, the higher the minimum variance of an 
estimator is expected to be. On the other hand, as the number of observations increases and more “in- 
formation” is disclosed to us, the uncertainty decreases and we expect the variance of our estimator to 
decrease. 

Having obtained the lower bound for our task, let us return our attention to the LS estimator for 
the specific regression model of Eq. (3.33). This results from Eq. (3.13) by setting x n = 1, and a sim¬ 
ple inspection shows that the LS estimate is nothing but the sample mean, y, of the observations. 
Furthermore, the variance of the corresponding estimator is given by 
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which coincides with our previous finding via the use of the Cramer-Rao theorem. In other words, for 
this particular task and having assumed that the noise is white Gaussian, the LS estimator y is an MVU 
estimator and it attains the Cramer-Rao bound. However, if the input is not fixed, but it also varies 
from experiment to experiment and the training data become (y n , x „), then the LS estimator attains the 
Cramer-Rao bound only asymptotically, for large values of N (Problem 3.8). Moreover, it has to be 
pointed out that if the assumptions for the noise being Gaussian and white are not valid, then the LS 
estimator is not efficient anymore. 

It turns out that this resuit, which has been obtained for the real axis case, is also true for the general 
regression model given in Eq. (3.10) (Problem 3.9). We will return to the properties of the LS estimator 
in more detail in Chapter 6. 

Remarks 3.2. 


The Cramer-Rao bound is not the only one that is available in the literature. For example, the Bhat- 
tacharyya bound makes use of higher-order derivatives of the PDF. It turns out that in cases where an 
efficient estimator does not exist, the Bhattacharyya bound is tighter compared to the Cramer-Rao 
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one with respect to the variance of the MVU estimator [27]. Other bounds also exist [21]; however, 
the Cramer-Rao bound is the easiest to determine. 


3.7 SUFFICIENT STATISTIC 

lf an efficient estimator does not exist, this does not necessarily mean that the MVU estimator cannot 
be determined. It may exist, but it will not be an efficient one, in the sense that it does not satisfy the 
Cramer-Rao bound. In such cases, the notion of sufficient statistic and the Rao-Blackwell theorem 
come into the picture. Note that such techniques are beyond the focus of this book and they are 
mentioned here in order to provide a more complete picture of the topic. In our context, concerning 
the needs of the book, we will refer to and use the sufficient statistic notion when dealing with the 
exponential family of distributions in Chapter 12. The section can be bypassed in a hrst reading. 

The notion of sufficient statistic is due to Sir Ronald Aylmer Fisher (1890-1962). Fisher was an 
English statistician and biologist who made a number of fundamental contributions that laid out many 
of the foundations of modern statistics. Besides statistics, he made important contributions in genetics. 

In short, given a random vector x, whose distribution depends on a parameter 9 , a sufficient statistic 
for the unknown parameter is a function, 

T(X) :=T(xi,x 2 ,...,x N ), 

of the respective observations, which contains ali information about 9. From a mathematical point of 
view, a statistic T(X) is said to be sufficient for the parameter 9 if the conditional joint PDF 

p(X\T(Xy,9) 

does not depend on 9. In such a case, it becomes apparent that T(X) must provide ali information 
about 9, which is contained in the set X. Once 7’(A'’) is known, X is no longer needed, because no 
further information can be extracted from it; this justihes the name of “sufficient statistic.” The concept 
of sufficient statistic is also generalized to parameter vectors 0. In such a case, the sufficient statistic 
may be a set of functions, called a jointly sufficient statistic. Typically, there are as many functions as 
there are parameters; in a slight abuse of notation, we will stili write T(X) to denote this set (vector 
of) functions. 

A very important theorem, which facilitates the search for a sufficient statistic in practice, is the 
following [27]. 

Theorem 3.2 (Factorization theorem). A statistic T ( X ) is sufficient if and only if the respective joint 
PDF can be factored as 

p{X- 0) = h{X)g(T(X),0). 

That is, the joint PDF is factored into two parts: one part that depends only on the statistic and the 
parameters and a second part that is independent ofthe parameters. The theorem is also known as the 
Fisher—Neyman factorization theorem. 


6 


It must be pointed out that the use of the sufficient statistic in statistics extends rnuch beyond the search for MVU estimators. 
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Once a sufficient statistic has been found and under certain conditions related to the statistic, the 
Rao-Blackwell theorem determines the MVU estimator by taking the expectation conditioned on 
T(X). A by-product of this theorem is that if an unbiased estimator is expressed solely in terms of 
the sufficient statistic, then it is necessarily the unique MVU estimator [23]. The interested reader can 
obtain more on these issues from [20,21,27]. 

Example 3.3. Let x be a Gaussian random variable, A'7/x, er 2 ), and let the set of i.i.d. observations be 
X = {xi , X 2 , ■ ■ ■, x,v). Assume the mean value, /x, to be the unknown parameter. Show that 



is a sufficient statistic for the parameter /x. 
The joint PDF is given by 



Plugging the obvious identity 


N N 


- /x) 2 = ^2(x n - S |Lt ) 2 + NiSft - /X) 2 


n =1 n =1 


into the joint PDF, we obtain 



1 


p(X\ /x) =- F exp 

(2jtct 2 )t 


which, according to the factorization theorem, proves the claim. 

In a similar way, one can prove (Problem 3.10) that if the unknown parameter is the variance 


er 2 , then S a i := T — /x) 2 is a sufficient statistic, and if both /x and a 2 are unknown, then a 

sufficient statistic is the set (5^, S a 2 ), where 



n =1 


In other words, in this case, all information concerning the unknown set of parameters that can be 
possibly extracted from the available N observations can be fully recovered by considering only the 
sum of the observations and the sum of their squares. 
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3.8 REGULARIZATION 

We have already seen that the LS estimator is an MVU estimator, under the assumptions of linearity 
of the regression model and in the presence of a Gaussian white noise source. We also know that one 
can improve the performance by shrinking the norm of the MVU estimator. There are various ways 
to achieve this goal, and they will be discussed later on in this book. In this section, we focus on one 
possibility. Moreover, we will see that trying to keep the norm of the solution small serves important 
needs in the context of machine learning. 

Regularization is a mathematical tool to impose a priori information on the structure of the solu¬ 
tion, which comes as the outcome of an optimization task. Regularization was first suggested by the 
great Russian mathematician Andrey Nikolayevich Tychonoff (sometimes spelled Tikhonov) for the 
solution of integral equations. Sometimes it is also referred as Tychonoff-Phillips regularization, to 
honor Thomas Phillips as well, who developed the method independently [29,39]. 

In the context of our task and in order to shrink the norm of the parameter vector estimate, we can 
reformulate the sum of squared errors minimization task, given in Eq. (3.12), as 

N 2 

minimize J (0) = ^ (y n — 0 r x n ^j , (3.39) 

n= l 

subject to \\0\\ 2 <p, (3.40) 

where ||-|| stands for the Euclidean norm of a vector. In this way, we do not allow the LS criterion to 
be completely "free” to reach a solution, but we liniit the space in which we search for it. Obviously, 
using different values of p. we can achieve different levels of shrinkage. As we have already discussed, 
the optimal value of p cannot be analytically obtained, and one has to experiment in order to select 
an estimator that results in a good performance. For the squared error loss function and the constraint 
used before, the optimization task can equivalently be written as [6,8] 


minimize 


N 

L(0,k) = J2 (v« 

n=l 



+ k || 0|| 2 


ridge regression. 


(3.41) 


It turns out that for specific choices of X > 0 and p, the two tasks are equivalent. Note that this new cost 
function, L(0, X). involves one term that measures the model misfit and a second one that quantifies 
the size of the norm of the parameter vector. It is straightforward to see that taking the gradient of L in 
Eq. (3.41) with respect to 0 and equating to zero, we obtain the regularized version of the LS solution 
for the linear regression task of Eq. (3.13), 

( N \ N 

Y2 x nXn + XIj0 = y^y„x„, (3.42) 

where I is the identity matrix of appropriate dimensions. The presence of X biases the new solution 
away from that which would have been obtained from the unregularized LS formulation. The task is 
also known as ridge regression. Ridge regression attempts to reduce the norm of the estimated vector 
and at the same time tries to keep the sum of squared errors small; in order to achieve this combined 
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goal, the vector components, 0are modified in such a way so that the contribution in the misfit 
measuring term, from the less informative directions in the input space, is minimized. In other words, 
those of the components that are associated with less informative directions will be pushed to smaller 
values to keep the norm small and at the same time have minimum influence on the misfit measuring 
term. We will return to this in more detail in Chapter 6. Ridge regression was first introduced in [18]. 

It has to be emphasized that in practice, the bias parameter Oq is left out from the norm in the 
regularization term; penalization of the bias would make the procedure dependent on the origin chosen 
for y. Indeed, it is easily checked that adding a constant term to each one of the output values, y n , in 
the cost function would not resuit in just a shift of the predictions by the same constant, if the bias term 
is included in the norm. Hence, usually, ridge regression is formulated as 

minimize L(0, X) = ^ I y n - 0 O - ^ 9jX ni j + A. ^ |0, | 2 . (3.43) 

«= t V ;=t / i=l 

It turns out (Problem 3.11) that minimizing Eq. (3.43) with respect to 0;, i = 0,1,..., /, is equivalent 
to minimizing Eq. (3.41) using centered data and neglecting the intercept. That is, one solves the task 

N / I \ 2 1 

minimize L(0, A.) = ^ I (y„ - y) - ^0,-(x m - - x t ) ) + A^|0,-| 2 , (3.44) 

n= 1 \ i = l / 1 = 1 

and the estimate of 9q in Eq. (3.43) is given in terms of the obtained estimates 0/, i.e., 

/ 

0o = y - Tat, 

i=i 


where 

1 1 
y=— 2_yn and Xi =—2_x ni , 1 = 
n =1 /i=l 

In other words, 0o compensates for the differences between the sample means of the output and in¬ 
put variables. Note that similar arguments hold true if the Euclidean norm, used in Eq. (3.42) as a 
regularizer, is replaced by other norms, such as l\ or in general t p , p > 1, norms (Chapter 9). 

From a different viewpoint, reducing the norm can be considered as an attempt to “simplify” the 
structure of the estimator, because a smaller number of components of the regressor now have an 
important say. This viewpoint becomes more ciear if one considers nonlinear models, as discussed in 
Section 3.2. In this case, the existence of the norm of the respective parameter vector in Eq. (3.41) 
forces the model to get rid of the less important terms in the nonlinear expansion, Okfaix), and 
effectively pushes K to lower values. 

Although in the current context the complexity issue emerges in a rather disguised form, one can 
make it a major player in the game by choosing to use different functions and norms for the regular¬ 
ization term; as we will see next, there are many reasons that justify such choices. 
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INVERSE PROBLEMS: ILL-CONDITIONING AND OVERFITTING 

Most tasks in machine learning belong to the so-called inverse problems. The latter term encompasses 
all the problems where one has to infer/predict/estimate the values of a model based on a set of avail- 
able output/input observations (training data). In a less mathematical terminology, in inverse problems 
one has to unravel unknown causes from known effects, or in other words, to reverse the cause-effect 
relations. Inverse problems are typically ill-posed, as opposed to the well-posed ones. Well-posed prob¬ 
lems are characterized by (a) the existence of a solution, (b) the uniqueness of the solution, and (c) the 
stability of the solution. The latter condition is usually violated in machine learning problems. This 
means that the obtained solution may be very sensitive to changes of the training set. Ill-conditioning 
is another term used to describe this sensitivity. The reason for this behavior is that the model used 
to describe the data can be complex, in the sense that the number of the unknown free parameters is 
large with respect to the number of data points. The “face” with which this problem manifests itself in 
machine learning is known as overfitting. This means that during training, the estimated parameters of 
the unknown model learn too much about the idiosyncrasies of the specific training data set, and the 
model performs badly when it deals with data sets other than that used for training. As a matter of fact, 
the MSE criterion discussed in Section 3.5 attempts to quantify exactly this data dependence of the 
task, that is, the mean deviation of the obtained estimates from the true value, by changing the training 
sets. 

When the number of training samples is small with respect to the number of unknown parameters, 
the available information is not enough to “reveal” a sufficiently good model that fits the data, and it can 
be misleading due to the presence of the noise and possible outliers. Regularization is an elegant and 
efficient tool to cope with the complexity of the model, that is, to make it less complex, more smooth. 
There are different ways to achieve this. One way is by constraining the norm of the unknown vector, 
as ridge regression does. When dealing with more complex, compared to linear, models, one can use 
constraints on the smoothness of the involved nonlinear function, for example by involving derivatives 
of the model function in the regularization term. Also, regularization can help when the adopted model 
and the number of training points are such that no solution is possible. For example, in the LS linear 
regression task of Eq. (3.13), if the number N of training points is less than the dimension of the 
regressors x n , then the (/ + 1) x (/ + 1) matrix, £ = x n xj t , is not invertible. Indeed, each term in 
the summation is the outer product of a vector with itself and hence it is a matrix of rank one. Thus, 
as we know from linear algebra, we need at least l + 1 linearly independent terms of such matrices to 
guarantee that the sum is of full rank, and hence invertible. However, in ridge regression, this can be 
bypassed, because the presence of XI in Eq. (3.42) guarantees that the left-hand matrix is invertible. 
Furthermore, the presence of XI can also help when E is invertible but it is ill-conditioned. Usually in 
such cases, the resulting LS solution has a very large norm and, thus, it is meaningless. Regularization 
helps to replace the original ill-conditioned problem with a “nearby” one, which is well-conditioned 
and whose solution approximates the target one. 

Another example where regularization can help to obtain a solution, or even a unique solution to 
an otherwise unsolvable problem, is when the model’s order is large compared to the number of data, 
although we know that it is sparse. That is, only a very small percentage of the model’s parameters 
are nonzero. For such a task, a Standard LS linear regression approach has no solution. However, 
regularizing the sum of squared errors cost function using the t\ norm of the parameter vector can 
lead to a unique solution; the l\ norm of a vector comprises the sum of the absolute values of its 
components. This problem will be considered in Chapters 9 and 10. 
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Regularization is closely related to the task of using priors in Bayesian learning, as we will discuss 
in Section 3.11. Finally, note that regularization is not a panacea for facing the problem of overfitting. 
As a matter of fact, selecting the right set of functions T in Eq. (3.3) is the first crucial step. The issue 
of the complexity of an estimator and the consequences on its “average” performance, as measured 
over ali possible data sets, is discussed in Section 3.9. 

Example 3.4. The goal of this example is to demonstrate that the estimator obtained via ridge regres- 
sion can score a better MSE performance compared to the unconstrained LS solution. Let us consider, 
once again, the scalar model exposed in Example 3.2, and assume that the data are generated according 
to 


y n =8o + r ln, n = 1,2,..., N, 

where, for simplicity, we have assumed that the regressors x n = 1 and ??„, n = 1,2. N , are i.i.d. 

zero mean Gaussian noise samples of variance er~. 

We have already seen in Example 3.2 that the solution to the LS parameter estimation task is the 
sample mean £?mvu = jj yn- We have also shown that this solution scores an MSE of a%/N and 
under the Gaussian assumption for the noise it achieves the Cramer-Rao bound. The question now is 
whether a biased estimator, 0/,, which corresponds to the solution of the associated ridge regression 
task, can achieve an MSE lower than MSE(0 mvu). 

It can be readily verified that Eq. (3.42), adapted to the needs of the current linear regression sce- 
nario, results in 

* 1 ^—-\ N * 

Ob(X) = - > y„ =- $mvu> 

N + X^ N + X 

n= 1 

where we have explicitly expressed the dependence of the estimate 0/, on the regularization parameter X. 
Note that for the associated estimator we have E[0ft(k)] = 

A simple inspection of the previous relation takes us back to the discussion related to Eq. (3.24). 
Indeed, by following a sequence of steps similar to those in Section 3.5.1, one can verify (see Prob¬ 
lem 3.12) that the minimum value of MSE IO/,) is 


MSE(0 fo (A.*)) 


v 

N 


<j j = MSE(0 M vu), 


N6 2 


(3.45) 


attained at A.* = er 2 /@ 2 . The answer to the question whether the ridge regression estimate offers an 
improvement to the MSE performance is therefore positive in the current context. As a matter of fact, 
there always exists a X > 0 such that the ridge regression estimate, which solves the general task of 
Eq. (3.41), achieves an MSE lower than the one corresponding to the MVU estimate [5, Section 8.4]. 

We will now demonstrate the previous theoretical findings via some simulations. To this end, the 
true value of the model was chosen to be 0„ — 10 -2 . The noise was Gaussian of zero mean value and 
variance = 0.1. The number of i.i.d. generated samples was N — 100. Note that this is quite large, 
compared to the single parameter we have to estimate. The previous values imply that 0 2 o < rf/N. 
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Table 3.1 Attained values of the 
MSE for ridge regression and dif¬ 
ferent values of the regularization 
parameter. 


100.0 

A.* = 10 3 


X 

0.1 

1.0 


MSE(6 fe a)) 

9.99082 x 10“ 4 
9.79790 x 10“ 4 
2.74811 x 10“ 4 
9.09671 x IO” 5 


The attained MSE for the unconstrained 
LS estimate was MSE(Qmvu ) = 1-00108 x 
IO" 3 . 


Then it can be shown that for any value of X > 0, we can obtain a value for MSL(0/,(7,)) which is 
smaller than that of MSE(9 mvu) (see Problem 3.12). This is verified by the values shown in Table 3.1. 
To compute the MSE values in the table, the expectation operation in the definition in Eq. (3.21) was 
approximated by the respective sample mean. To this end, the experiment was repeated L times and 
the MSE was computed as 



i=i 


To get accurate results, we perform L — 10 6 trials. The corresponding MSE value for the unconstrained 
LS task is equal to MSE(0 mvu) = 1.00108 x 10 -3 . Observe that substantial improvements can be 
attained when using regularization, in spite of the relatively large number of training data. 

However, the percentage of performance improvement depends heavily on the specific values that 
define the model, as Eq. (3.45) suggests. For example, if 0 o = 0.1, the obtained values from the ex- 
periments were MSE(Omvu) = 1.00061 x 10 -3 and MSE(0^(A*)) = 9.99578 x 10 -4 . The theoretical 
ones, as computed from Eq. (3.45), are 1 x 10 -3 and 9.99001 x 10 , respectively. The improvement 

obtained by using the ridge regression is now rather insignificant. 


3.9 THE BIAS-VARIANCE DILEMMA 


This section goes one step beyond Section 3.5. There, the MSE criterion was used to quantify the 
performance with respect to the unknown parameter. Such a setting was useful to help us understand 
some trends and also better digest the notions of “biased” versus “unbiased” estimation. Here, although 
the criterion will be the same, it will be used in a more general setting. To this end, we shift our interest 
from the unknown parameter to the dependent variable and our goal becomes obtaining an estimator of 
the value y, given a measurement of the regressor vector, x = x. Let us first consider the more general 
form of regression. 


y = <? (x) + h, 


(3.46) 
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where, for simplicity and without loss of generality, once more we have assumed that the dependent 
variable takes values in the real axis, yel. The first question to be addressed is whether there exists 
an estimator that guarantees minimum MSE performance. 

3.9.1 MEAN-SQUARE ERROR ESTIMATION 

Our goal is estimating an unknown (nonlinear in general) function g(x). This problem can be cast in 
the context of the more general estimation task setting. 

Consider the jointly distributed random variables y, x. Then, given any observation, x = x e B/, the 
task is to obtain a function y := g(x) e M, such that 

g(x) = argmin /:M t^ R E [(y - f(x )) 2 j , (3.47) 

where the expectation is taken with respect to the conditional probability of y given the value of x, or 
in other words, p(gy\x). 

We will show that the optimal estimator is the mean value of y, or 


g(x) = E 



yp(y\x)dy : 


optimal MSE estimator. 


(3.48) 


Proof. We have 


E [(y - /(x)) 2 ] = E [(y - E[y|x] + E[y|x] - /(x)) 2 ] 

= E [(y - E[y|x]) 2 ] + E [(E[y|x] - /(x)) 2 ] 
+ 2E [(y — E[y|x])(E[y|x] — /(x))], 


where the dependence of the expectation on x has been omitted for notational convenience. It is readily 
seen that the last (product) term on the right-hand side is zero, hence, we are left with the following: 

E [(y - /(x)) 2 ] = E [(y - E[y|x]) 2 ] + (E[y|x] - /(x)) 2 , (3.49) 

where we have taken into account that, for fixed x, the terms E[y |x] and /(x) are not random variables. 
From Eq. (3.49) we finally obtain our claim, 

E [(y - /(x)) 2 ] > E [(y - E[y|x]) 2 ] . (3.50) 

Note that this is a very elegant resuit. The optimal estimate, in the MSE sense, of the value of the 
unknown function at a point x is given as g(x) = E[y|x]. Sometimes, the latter is also known as the 
regression of y conditioned on x = x. This is, in general, a nonlinear function. It can be shown that 
if (y, x) take values in K x R ( and are jointly Gaussian, then the optimal MSE estimator E[y|x] is a 
linear (affine) function of x. 
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The previous results generalize to the case where y is a random vector that takes values in Mr . The 
optimal MSE estimate, given the values of x = x, is equal to 

g(x) = E[y|x], 

where now g(x) e E‘ k (Problem 3.15). Moreover, if (y, x) are jointly Gaussian random vectors, the 
MSE optimal estimator is also an affine function of x (Problem 3.16). 

The findings of this subsection can be fully justified by physical reasoning. Assume, for simplicity, 
that the noise source in Eq. (3.46) is of zero mean. Then, for a fixed value x = x, we have E[y|x] = g(x) 
and the respective MSE is equal to 


MSE = E[(y-E[y|*]) 2 ]=a 2 . (3.51) 

No other function of x can do better, because the optimal one achieves an MSE equal to the noise 
variance, which is irreducible; it represents the intrinsic uncertainty of the system. As Eq. (3.49) sug- 
gests, any other function f(x) will resuit in an MSE larger by the factor (E[y|x] — /(x)) 2 , which 
corresponds to the deviation from the optimal one. 

3.9.2 BIAS-VARIANCE TRADEOFF 

We have just seen that the optimal estimate, in the MSE sense, of the dependent variable in a regres- 
sion task is given by the conditional expectation E[y|x]. In practice, any estimator is computed based 
on a specific training data set, say, V. Let us make the dependence on the training set explicit and 
express the estimate as a function of x parameterized on 'D, or /(x: V). A reasonable measure to 
quantify the performance of an estimator is its mean-square deviation from the optimal one, expressed 
by Ex>[(/(jc; V) — E[y|x]) 2 ], where the mean is taken with respect to allpossible training sets, because 
each one results in a different estimate. Following a similar path as for Eq. (3.22), we obtain 


E V [(/(x; V) - E[y|x]) 2 ] = E v [(/(x; V) - E v [f(x- 1?)]) 2 ] + 

Variance 

(E c [/(x;D)]-E[y|x])\ 

'- ' 

Bias 2 


(3.52) 


As was the case for the MSE parameter estimation task when changing from one training set to another, 
the mean-square deviation from the optimal estimate comprises two terms. The first one is the variance 
of the estimator around its own mean value and the second one is the squared difference of the mean 
from the optimal estimate, that is, the bias. It turns out that one cannot make both terms small simulta- 
neously. For a fixed number of training points N in the data sets T>, trying to minimize the variance term 
results in an increase of the bias term and vice versa. This is because in order to reduce the bias term, 
one has to increase the complexity (more free parameters) of the adopted estimator /(■; T>). This, in 
turn, results in higher variance as we change the training sets. This is a manifestation of the overfitting 
issue that we have already discussed. The only way to reduce both terms simultaneously is to increase 
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the number of the training data points N, and at the same time increase the complexity of the model 
carefully, so as to achieve the aforementioned goal. If one increases the number of training points and 
at the same time increases the model complexity excessively, the overall MSE may increase. This is 
known as the bias-variance dilemma or tradeoff. This is an issue that is omnipresent in any estimation 
task. Usually, we refer to it as Occam’s razor rule. 

Occam was a logician and a nominalist scholastic medieval philosopher who expressed the follow- 
ing law of parsimony: "Plurality must never be posited without necessity.” The great physicist Paul 
Dirae expressed the same statement from an estheties point of view, which underlies mathematical the¬ 
ories: “A theory with a mathematical beauty is more likely to be correct than an ugly one that fits the 
data.” In our context of model selection, it is understood that one has to select the simplest model that 
can “explain” the data. Although this is not a scientifically proven resuit, it underlies the rationale be- 
hind a number of developed model selection techniques [1,32,33,40] and [37, Chapter 5], which trade 
off complexity with accuracy. 

Let us now try to find the MSE, given x. by considering all possible sets T>. To this end, note that 
the left-hand side of Eq. (3.52) is the mean with respect to V of the second term in Eq. (3.49), if V 
is brought explicitly in the notation. It is easy to see that, by reconsidering Eq. (3.49) and taking the 
expectation on both y and T>, given the value of x = x, the resulting MSE becomes (try it, following 
similar arguments as for Eq. (3.52)) 


MSE(x) = E yi * E v [(y - /(*; V)f\ 

= + E© [(/(x; V) - E v lf(x ; D)]) 2 ] 



(3.53) 


where Eq. (3.51) has been used and the product rule, as stated in Chapter 2, has been exploited. In 
the sequel, one can take the mean over x. In other words, this is the prediction MSE over all possible 
inputs, averaged over all possible training sets. The resulting MSE is also known as the test or gener- 
alization error and it is a measure of the performance of the respective adopted model. Note that the 
generalization error in Eq. (3.53) involves averaging over (theoretically) all possible training data sets 
of certain size N. In contrast, the so-called training error is computed over a single data set, the one 
used for the training, and this results in an overoptimistic estimate of the error. We will come back to 
this important issue in Section 3.13. 

Example 3.5. Let us consider a simplistic, yet pedagogic, example that demonstrates this tradeoff 
between bias and variance. We are given a set of training points that are generated according to a 
regression model of the form 


y = g(x) + ri- 


(3.54) 


The graph of g(x) is shown in Fig. 3.7. The function g(x) is a fifth-order polynomial. Training sets are 
generated as follows. For each set, the x-axis is sampled at N equidistant points x n , n = 1,..., N, in 
the interval [—1, 1]. Then, each training set D, is created as 


'Dj = {(g(x„) + r] n i, x n ) :n = \,2,...,N), * = 1,2,..., 
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where rj n , denotes different noise samples, drawn i.i.d. from a white noise process. In other words, all 
training points have the same x-coordinate but different y-coordinates due to the different values of the 
noise. The gray dots in the (x, y)-plane in Fig. 3.7 correspond to one realization of a training set, for 
the case of N — 10. For comparison, the set of noiseless points, (g(x„), x„), n = 1 .2, .... 10, is shown 
as red dots on the graph of g(x). 

First, we are going to be very naive, and choose a hxed linear model to fit the data, 

y = fl(x) = 0 0 + 9ix, 

where the values 0\ and Oq have been chosen arbitrarily, irrespective of the training data. The graph of 
this straight line is shown in Fig. 3.7. Because no training was involved and the model parameters are 
hxed, there is no variation as one changes the training sets and Ep[/i (x)] = /i (x ), with the variance 
term being equal to zero. On the other hand, the square of the bias, which is equal to (/)(x) — E[y|x]) 2 , 
is expected to be large because the choice of the model was arbitrary, without paying attention to the 
training data. 

In the sequel, we go to the other “extreme.” We choose a relatively complex class of functions, such 
as a high-order (lOth-order) polynomial / 2 . The dependence of the resulting estimates on the respective 
set T>, which is used for training, is explicitly denoted as / 2 O; V). Note that for each one of the sets, 
the corresponding graph of the resulting optimal model is expected to go through the training points, 
i.e., 

fl(x n \ T>i) = g(x„) + rjn,i > n = 1, 2,..., N. 

This is shown in Fig. 3.7 for one curve. Note that, in general, this is always the case if the order of the 
polynomial (model) is large and the number of parameters to be estimated is larger than the number of 



FIGURE 3.7 

The observed data are denoted as gray dots. These are the resuit of adding noise to the red points, which lie on the 
red curve associated with the unknown g(-). Fitting the data by the hxed polynomial /i(x) results in high bias; 
observe that most of the data points lie outside the straight line. On the other hand, the variance of the “estimator” 
will be zero. In contrast, fitting a high-degree polynomial / 2 (x; D) results in low bias, because the corresponding 
curve goes through all the data points; however, the respective variance will be high. 
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training points. For the example shown in the figure, we are given 10 training points and 11 parameters 
have to be estimated, for a lOth-order polynomial (including the bias). One can obtain a perfect fit on 
the training points by solving a linear system comprising 10 equations (N — 10) with 11 unknowns. 
This is the reason that when the order of the model is large, a zero (or very small, in practice) error can 
be achieved on the training data set. We will come back to this issue in Section 3.13. 

For this high-order polynomial fitting setup of the experiment, the following reasoning holds true. 
The bias term at each point x n , n= 1.2,.... /V, is zero, because 

Ex>[/2(* n : T>)] = E x>[g(.x n ) + q] = g(x n ) = ExTyl.r,,]. 


On the other hand, the variance term at the points x n , n = 1, 2,..., N, is expected to be large, because 


Ex> 


{fl(x n \V) - g(x n )Y 


= E-p 


(g(Xn) + ~ g(x n )Y 


Assuming that the functions /2 and g are continuous and smooth enough and the points x n are sampled 
densely enough to cover the interval of interest in the real axis, we expect similar behavior at ali points 

x ^ x n . 

Example 3.6. This is a more realistic example that builds upon the previous one. The data are generated 
as before, via the regression model in Eq. (3.54) using the fifth-order polynomial for g. The number of 
training points is equal to N = 10 and 1000 training sets 22/, i = 1,2..., 1000, are generated. Two sets 
of experiments have been run. 

The first one attempts to fit in the noisy data a high-order polynomial of degree equal to 10 (as 
in the previous example) and the second one adopts a second-order polynomial. For each one of the 
two setups, the experiment is repeated 1000 times, each time with a different data set 22/. Figs. 3.8A 
and C show 10 (for visibility reasons) out of the 1000 resulting curves for the high- and low-order 
polynomials, respectively. The substantially higher variance for the case of the high-order polynomial 
is readily noticed. Figs. 3.8B and D show the corresponding curves which resuit from averaging over 
the 1000 performed experiments, together with the graph of the “unknown” (original g) function. The 
high-order polynomial results in an excellent fit of very low bias. The opposite is true for the case of 
the second-order polynomial. 

Thus, in summary, for a fixed number of training points, the more complex the prediction model 
is (larger number of parameters), the larger the variance becomes, as we change from one training set 
to another. On the other hand, the more complex the model is, the smaller the bias gets; that is, the 
average model by training over different data sets gets closer to the optimal MSE one. The reader may 
find more information on the bias-variance dilemma task in [16]. 


3.10 MAXIMUM UKEUHOOD METH0D 

So far, we have approached the estimation problem as an optimization task around a set of training 
examples, without paying any attention to the underlying statistics that generates these points. We only 
used statistics in order to check under which conditions the estimators were efficient. However, the 
optimization step did not involve any statistical information. For the rest of the chapter, we are going to 








3.10 MAXIMUM LIKELIHOOD METHOD 99 






FIGURE 3.8 

(A) Ten of the resulting curves from fitting a lOth-order polynomial. (B) The corresponding average over 1000 
different experiments. The red curve represents the unknown polynomial. The dots indicate the points that give birth 
to the training data, as described in the text. (C), (D) The results from fitting a second-order polynomial. Observe 
the bias-variance tradeoff as a function of the complexity of the fitted model. 


involve statistics more and more. In this section, the ML method is introduced. It is not an exaggeration 
to say that ML and LS are two of the major pillars on which parameter estimation is based and new 
methods are inspired from. The ML method was suggested by Sir Ronald Aylmer Fisher. 

Once more, we will first formulate the method in a general setting, independent of the regres- 
sion/classification tasks. We are given a set of, say, N observations, X — {jci, JC 2 , • • •, x ; y}, drawn from 
a probability distribution. We assume that the joint PDF of these N observations is of a known para¬ 
metrio functional type, denoted as p(X: 0), where the parameter vector 0 e S K is unknown and the 
task is to estimate its value. This joint PDF is known as the likelihood function of 0 with respect to the 
given set of observations X. According to the ML method, the estimate is provided by 
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FIGURE3.9 

According to the maximum likelihood method, we assume that, given the set of observations, the estimate of the 
unknown parameter is the value that maximizes the corresponding likelihood function. 


0ml := arg max #e ^ cg: s piX: 0 ): maximum likelihood estimate. 


(3.55) 


For simplicity, we will assume that the constraint set. A, coincides with R*", i.e., A — R^, and that the 
parameterized family {p(X\ 0): 0 e R K | enjoys a unique minimizer with respect to the parameter 0. 
This is illustrated in Fig. 3.9. In other words, given the set of observations X — {jcj, x^, ..., x .y}, one 
selects the unknown parameter vector so as to make this joint event the most likely one to happen. 

Because the logarithmic function ln(-) is monotone and increasing, one can instead search for the 
maximum of the log-likelihood function. 


91n p(X\0) 

80 


= °. 

0=$ml 


(3.56) 


Assuming the observations to be i.i.d., the ML estimator has some very attractive properties, namely: 


• The ML estimator is asymptotically unbiased; that is, assuming that the model of the PDF which 
we have adopted is correct and there exists a true parameter 0 O , we have 


lim E[§ml] = 0o- 

N^-oo 


(3.57) 


• The ML estimate is asymptotically consistent so that given any value of e > 0, 


lim Prob | 0ml — 

N^-OO l I 



= 0 , 


(3.58) 


that is, for large values of N, we expect the ML estimate to be very close to the true value with high 
probability. 

• The ML estimator is asymptotically efficient; that is, it achieves the Cramer-Rao lower bound. 

• If there exists a sufficient statistic T ( X ) for an unknown parameter, then only T ( X) suffices to 
express the respective ML estimate (Problem 3.20). 
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• Moreover, assuming that an efficient estimator does exist, this estimator is optimal in the ML sense 
(Problem 3.21). 

Example 3.7. Let xX 2 , ■ ■ ■, x n be the observation vectors stemming from a multivariate normal 
distribution with known covariance matrix and unknown mean (Chapter 2), that is, 

P(Xl,; p) = ( 27 r)V 2 |X-|V 2 ex P (-» - ^ T Z~ l ( x » - ■ 

Assume that the observations are mutually independent. Obtain the ML estimate of the unknown mean 
vector. 

For the N statistically independent observations, the joint log-likelihood function is given by 

N N i N 

L(/l) = ln]~~[ p(x„\ fi) = ln ^(2 tt) / - fi) T £~\x n - /*). 

n =1 n =1 


Taking the gradient with respect to /r, we obtain 


dL(ii) 

dfl 


dL 
3/xi 
3 L 
3f-2 


N 

= ^2 

n =1 


3 L 
3 Hl 


and equating to 0 leads to 



n =1 


In other words, for Gaussian distributed data, the ML estimate of the mean is the sample mean. More¬ 
over, note that the ML estimate is expressed in terms of its sufficient statistic (see Section 3.7). 


3.10.1 LINEAR REGRESSION: THE NONWHITE GAUSSIAN NOISE CASE 

Consider the linear regression model 

y = 0 T x+r\. 

We are given N training data points (y n , x „), n = 1,2,..., N. The corresponding (unobserved) noise 
samples r) n , n = 1,..., N, are assumed to follow a jointly Gaussian distribution with zero mean and 


7 


Recall from matrix algebra that 


d(x T b ) 
dx 


= b and 


9Q; r Ax) 

dx 


= 2Ax if A is symmetric (Appendix A). 
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covariance matrix equal to Z rr That is, the corresponding random vector of all the noise samples 
stacked together, r| = [r| i,..., r| v ] 7 , follows the multivariate Gaussian distribution 

,r £, ~'') ■ 

Our goal is to obtain the ML estimate of the parameters, 0. 

Replacing ri with y — X0, and taking the logarithm, the joint log-likelihood function of 6 with 
respect to the training set is given by 

W) = - y ln(2 TT) - i ln|r ); | - i (y - X0) T (y - X0) , (3.59) 

where y := [y\, yi ,..., y./v] 7 ', and X := [jci, ... , x ; y] T stands for the input matrix. Taking the gradient 
with respect to 0, we get 

dL(0) T , 

= X T (y - X0) , (3.60) 

and equating to the zero vector, we obtain 

0 ML =(x T Z~ l xy l X T Z~ l y. (3.61) 


Remarks 3.3. 

• Compare Eq. (3.61) with the LS solution given in Eq. (3.17). They are different, unless the covari¬ 
ance matrix of the successive noise samples, X r] , is diagonal and of the form a,j/, that is, if the noise 
is Gaussian as well as white. In this case, the LS and ML Solutions coincide. However, if the noise 
sequence is non white, the two estimates differ. Moreover, it can be shown (Problem 3.9) that in this 
case of colored Gaussian noise, the ML estimate is an efficient one and it attains the Cramer-Rao 
bound, even if N is finite. 


3.11 BAYESIAN INFERENCE 

In our discussion so far, we have assumed that the parameters associated with the functional form of 
the adopted model are deterministic constants whose values are unknown to us. In this section, we 
will follow a different rationale. The unknown parameters will be treated as random variables. Hence, 
whenever our goal is to estimate their values, this is conceived as an effort to estimate the values of a 
specific realization that corresponds to the observed data. A more detailed discussion concerning the 
Bayesian inference rationale is provided in Chapter 12. As the name Bayesian suggests, the heart of 
the method beats around the celebrated Bayes theorem. Given two jointly distributed random vectors, 
say, x, 0, the Bayes theorem States that 


P(x. 0) = p(x\0)p(0) = p(0\x)p(x). 


(3.62) 
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David Bayes (1702-1761) was an English mathematician and a Presbyterian minister who first devel- 
oped the basies of the theory. However, it was Pierre-Simon Laplace (1749-1827), the great French 
mathematician, who further developed and popularized it. 

Assume that x, 0 are two statistically dependent random vectors. Let X — {x„ et 1 ,n = 1,2,..., N} 
be the set of observations resulting from N successive experiments. Then the Bayes theorem gives 


P(0 \X) = 


p(X\0)p{0) 

p{X) 


P(X\0)P(0) 
fp(X\0)p(0)d0 


(3.63) 


Obviously, if the observations are i.i.d., then we can write 


N 

p(X\0) = Y\p(x n \0). 

n= 1 

In the previous formulas, p(0) is the a priori or prior PDF concerning the statistical distribution of 
0 , and p(0\X) is the conditional or a posteriori or posterior PDF, formed after the set of N obser¬ 
vations has been obtained. The prior probability density, p(0), can be considered as a constraint that 
encapsulates our prior knowledge about 0 . Undoubtedly, our uncertainty about 0 is modified after 
the observations have been received, because more information is now disclosed to us. If the adopted 
assumptions about the underlying models are sensible, we expect the posterior PDF to be a more ac¬ 
curate one to describe the statistical nature of 0 . We will refer to the process of approximating the 
PDF of a random quantity based on a set of training data as inference, to differentiate it from the 
process of estimation, which returns a single value for each parameter/variable. So, according to the 
inference approach, one attempts to draw conclusions about the nature of the randomness that under- 
lies the variables of interest. This information can in turn be used to make predictions and to take 
decisions. 

We will exploit Eq. (3.63) in two ways. The first refers to our familiar goal of obtaining an estimate 
of the parameter vector 0, which “Controls” the model that describes the generation mechanism of 
our observations, xq, xq, ... ,x ; .y. Because x and 0 are two statistically dependent random vectors, we 
know from Section 3.9 that the MSE optimal estimate of the value of 0 , given X, is 

0 = E[Q\X] = J 0p(0\X)d0. (3.64) 

Another direction along which one can exploit the Bayes theorem, in the context of statistical 
inference, is to obtain an estimate of the PDF of x given the observations X. This can be done by 
marginalizing over a distribution, using the equation 

p(x\X) = J p(x\0)p(0\X)d0, (3.65) 

where the conditional independence of x on X, given the value 0 = 0 , expressed as p(x\X. 0) = 
p(x\0), has been used. Indeed, if the value 0 is given, then the conditional p(x\0) is fully defined 
and does not depend on X. The dependence of x on A 1 is through 0 , if the latter is unknown. Eq. (3.65) 
provides an estimate of the unknown PDF, by exploiting the information that resides in the obtained 
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observations and in the adopted functional dependence on the parameters 0. Note that, in contrast to 
what we did in the case of the ML method, where we used the observations to obtain an estimate of the 
parameter vector, here we assume the parameters to be random variables, provide our prior knowledge 
about 0 via p(0). and integrate the joint PDF, p(x, 6\ X), over 0. 

Once p(x\X) is available, it can be used for prediction. Assuming that we have obtained the ob¬ 
servations xi,..., x n, our estimate about the next value, Xjv+i, can be determined via p(x n+i\X). 
Obviously, the form of p(x\X) is, in general, changing as new observations are obtained, because each 
time an observation becomes available, part of our uncertainty about the underlying randomness is 
removed. 

Example 3.8. Consider the simplihed linear regression task of Eq. (3.33) and assume x = 1. As we 
have already said, this problem is that of estimating the value of a constant buried in noise. Our method- 
ology will follow the Bayesian philosophy. Assume that the noise samples are i.i.d. drawn from a 
Gaussian PDF of zero mean and variance a~. However, we impose our a priori knowledge concerning 
the unknown 0 via the prior distribution 


p{0) — M {do, ctq). ( 3 . 66 ) 

That is, we assume that we know that the values of 0 lie around Oq, and crfi quantifies our degree of 
uncertainty about this prior knowledge. Our goals are hrst to obtain the posterior PDF, given the set 
of observations y = [yi,..., yN ] T , and then to obtain E[0|y], according to Eqs. (3.63) and (3.64) after 
adapting them to our current notational needs. We have 


P(0\y) = 


p(y\0)p(6) 
p(y ) 



i 

p(y) 



\/2 JTO 


- exp 


(v„-0) 2 \\ 

^ I) 


«/Inoo 


exp -- 


(0 - 6 o ) 2 


2cr n - 


(3.67) 


After some algebraic manipulations of Eq. (3.67) (Problem 3.25), one ends up with the following: 


Pi0\y) = 


\j 2 7T (7 /V 


exp | 


(0 - OnY 
2 °n 


( 3 . 68 ) 


where 

- No-q y N + cr 2 0 o 

On = 


Act 0 2 + rf 


( 3 . 69 ) 
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with y N = jj- y n being the sample mean of the observations and 


2 2 

< CT o' 


J N ' 


Nor 


(3.70) 


In other words, if the prior and conditional PDFs are Gaussians, then the posterior is also Gaussian. 
Moreover, the mean and the variance of the posterior are given by Eqs. (3.69) and (3.70), respec- 
tively. 

Observe that as the number of observations increases, (9,y tends to the sample mean of the observa¬ 
tions; recall that the latter is the estimate that results from the ML method. Also, note that the variance 
keeps decreasing as the number of observations increases, which is in line with common sense, because 
more observations mean less uncertainty. Fig. 3.10 illustrates the previous discussion. Data samples, 
y n , were generated using a Gaussian pseudorandom generator with the mean equal to 0 — I and vari¬ 
ance equal to o~ = 0.1. So the true value of our constant is equal to 1. We used a Gaussian prior PDF 
with mean value equal to 9q — 2 and variance <r ( y = 6. We observe that as N increases, the posterior 
PDF gets narrower and its mean tends to the true value of 1. 



FIGURE 3.10 

In the Bayesian inference approach, note that as the number of observations increases, our uncertainty about the 
true value of the unknown parameter is reduced and the mean of the posterior PDF tends to the true value and the 
variance tends to zero. 

It should be pointed out that in this example case, both the ML and LS estimates become identical, 
or 

1 N 

§= n1 Zyn = VN. 

n —1 

This will also be the case for the mean value in Eq. (3.69) if we set o v? very large, as might happen if 
we have no confidence in our initial estimate of &o and we assign a very large value to o ( j. In effect, 
this is equivalent to using no prior information. 
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Let us now investigate what happens if our prior knowledge about 0q is “embedded” in the LS 
criterion in the form of a constraint. This can be done by modifying the constraint in Eq. (3.40) such 
that 


(6-e 0 ) 2 <p, (3.71) 

which leads to the equivalent minimization of the following Lagrangian: 

N 

minimize L(6, X) = ^(y„ — 0) 2 + X ^(0 — 9q) 2 — pj . (3.72) 

n= l 


Taking the derivative with respect to 6 and equating to zero, we obtain 

s Ny n + 

U = -, 

N + X 

which, for X — ct^/cXq , becomes identical to Eq. (3.69). The world is small after all! This has happened 
only because we used Gaussians both for the conditional and for the prior PDFs. For different forms of 
PDFs, this would not be the case. However, this example shows that a close relationship ties priors and 
constraints. They both attempt to impose prior information. Each method, in its own unique way, is 
associated with the respective pros and cons. In Chapters 12 and 13, where a more extended treatment 
of the Bayesian inference task is provided, we will see that the very essence of regularization, which is 
a means against overfitting, lies at the heart of the Bayesian approach. 

Finally, one may wonder if the Bayesian inference has offered us any more information, compared 
to the deterministic parameter estimation path. After all, when the aim is to obtain a specific value for 
the unknown parameter, taking the mean of the Gaussian posterior comes to the same solution which 
results from the regularized LS approach. Well, even for this simple case, the Bayesian inference readily 
provides a piece of extra information; this is an estimate of the variance around the mean, which is very 
valuable in order to assess our trust of the recovered estimate. Of course, all these are valid provided 
that the adopted PDFs offer a good description of the statistical nature of the process at hand [24]. 

Finally, it can be shown (Problem 3.26) that the previously obtained results can be generalized for 
the more general linear regression model, of nonwhite Gaussian noise, considered in Section 3.10, 
which is modeled as 


y — X0 + rj. 

It turns out that the posterior PDF is also Gaussian with mean value equal to 

m\y] =0o+ (V + X T S~ 1 xy 1 X T Z~ l (y - XOo) (3.73) 

and covariance matrix 

Ze\y = (Zo' +X T Z; 1 X) 


( 3 . 74 ) 
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3.11.1 THE MAXIMUM A POSTERIORI PROBABILITY ESTIMATION METHOD 


The maximum a posteriori probability estimation technique, usually denoted as MAP, is based on the 
Bayesian theorem, but it does not go as far as the Bayesian philosophy allows. The goal becomes that 
of obtaining an estimate which maximizes Eq. (3.63); in other words, 


0 MAP = argmaXfl p(0\X ): MAP estimate, 


( 3 . 75 ) 


and because p(X) is independent of 0, this leads to 


0map = argmaXfl p(X\0)p(0) 

= argmax e {lnp(X\0) + ln p(0)}. 


( 3 . 76 ) 


If we consider Example 3.8, it is a matter of simple exercise to obtain the MAP estimate and show 
that 


NyN H— 



( 3 . 77 ) 


Note that for this case, the MAP estimate coincides with the regularized LS solution for X = <r^/a ( y. 

Once more, we verify that adopting a prior PDF for the unknown parameter acts as a regularizer which 

embeds into the problem the available prior information. 

Remarks 3 . 4 . 

• Observe that for the case of Example 3.8, all three estimators, namely, ML, MAP, and the Bayesian 
(taking the mean), resuit asymptotically, as N increases, in the same estimate. This is a more general 
resuit and it is true for other PDFs as well as for the case of parameter vectors. As the number of 
observations increases, our uncertainty is reduced and p(X\6), p(0\X) peak sharply around a value 
of 0. This forces all the methods to resuit in similar estimates. However, the obtained estimates 
are different for finite values of N. More recently, as we will see in Chapters 12 and 13, Bayesian 
methods have become very popular, and they seem to be the preferred choice for a number of 
practical problems. 

• Choosing the prior PDF in the Bayesian methods is not an innocent task. In Example 3.8, we chose 
the conditional PDF (likelihood function) as well as the prior PDF to be Gaussians. We saw that the 
posterior PDF was also Gaussian. The advantage of such a choice was that we could come to closed 
form Solutions. This is not always the case, and then the computation of the posterior PDF needs 
sampling methods or other approximate techniques. We will come to that in Chapters 12 and 14. 
However, the family of Gaussians is not the only one with this nice property of leading to closed 
form Solutions. In probability theory, if the posterior is of the same form as the prior, we say that 
p(0) is a conjugate prior of the likelihood function p(X\0) and then the involved integrations can 
be carried out in closed form (see, e.g., [15,30] and Chapter 12). Hence, the Gaussian PDF is a 
conjugate of itself. 





108 CHAPTER 3 LEARNING IN PARAMETRIO MODELING 


• Just for the sake of pedagogical purposes, it is useful to recapitulate some of the nice properties that 
the Gaussian PDF possesses. We have met the following properties in various sections and problems 
in the book so far: (a) it is a conjugate of itself; (b) if two random variables (vectors) are jointly 
Gaussian, then their marginal PDFs are also Gaussian and the posterior PDF of one with respect 
to the other is also Gaussian; (c) moreover, the linear combination of jointly Gaussian variables 
turns out to be Gaussian; (d) as a by-product, it turns out that the sum of statistically independent 
Gaussian random variables is also a Gaussian one; and finally (e) the Central limit theorem States 
that the sum of a large number of independent random variables tends to be Gaussian, as the number 
of the summands increases. 


3.12 CURSE OF DIMENSIONALITY 

In a number of places in this chapter, we mentioned the need of having a large number of training 
points. In Section 3.9.2, while discussing the bias-variance tradeoff, it was stated that in order to end 
up with a low overall MSE, the complexity (number of parameters) of the model should be small 
enough with respect to the number of training points. In Section 3.8, overfitting was discussed and it 
was pointed out that if the number of training points is small with respect to the number of parameters, 
overfitting occurs. 

The question that is now raised is how big a data set should be, in order to be more relaxed concern- 
ing the performance of the designed predictor. The answer to the previous question depends largely on 
the dimensionality of the input space. It turns out that the larger the dimension of the input space is, 
the more data points are needed. This is related to the so-called curse of dimensionality , a term coined 
for the first time in [4]. 



FIGURE 3.11 

A simple experiment which demonstrates the curse of dimensionality. A number of 100 points are generated ran- 
domly, drawn from a uniform distribution, in order to fili the one-dimensional segment of length equal to one 
([1,2] x {1.5}) (red points), and the two-dimensional rectangular region of unit area [1,2] x [2, 3] (gray points). 
Observe that although the number of points in both cases is the same, the rectangular region is more sparsely 
populated than the densely populated line segment. 







3.13 VALIDATION 109 


Let us assume that we are given the same number of points, N, thrown randomly in a unit cube 
(hypercube) in two different spaces, one being of low and the other of very high dimension. Then, the 
average distance of the points in the latter case will be much larger than that in the low-dimensional 
space case. As a matter of fact, the average distance shows a dependence that is analogous to the 
exponential term where / is the dimensionality of the space [14,37]. For example, the av¬ 

erage distance between two out of 10 10 points in the two-dimensional space is 10 -5 , and in the 
40-dimensional space it is equal to 1.83. Fig. 3.11 shows two cases, each one consisting of 100 points. 
The red points lie on a (one-dimensional) line segment of length equal to one and were generated ac- 
cording to the uniform distribution. Gray points cover a (two-dimensional) square region of unit area, 
which were also generated by a two-dimensional uniform distribution. Observe that the square area is 
more sparsely populated compared to the line segment. This is the general trend and high-dimensional 
spaces are sparsely populated; thus, many more data points are needed in order to fili in the space with 
enough data. Fitting a model in a parameter space, one must have enough data covering sufficiently 
well all regions in the space, in order to be able to learn well enough the input-output functional 
dependence (Problem 3.13). 

There are various ways to cope with the curse of dimensionality and try to exploit the available 
data set in the best possible way. A popular direction is to resort to suboptimal Solutions by projecting 
the input/feature vectors in a lower-dimensional subspace or manifold. Very often, such an approach 
leads to small performance losses, because the original training data, although they are generated in 
a high-dimensional space, may in fact “live” in a lower-dimensional subspace or manifold, due to 
physical dependencies that restrict the number of free parameters. Take as an example a case where the 
data are three-dimensional vectors, but they lie around a straight line, which is a one-dimensional linear 
manifold (affine set or subspace if it crosses the origin) or around a circle (one-dimensional nonlinear 
manifold) embedded in the three-dimensional space. In this case, the true number of free parameters is 
equal to one; this is because one free parameter suffices to describe the location of a point on a circle 
or on a straight line. The true number of free parameters is also known as the intrinsic dimensionality 
of the problem. The challenge, now, becomes that of learning the subspace/manifold onto which to 
project. These issues will be considered in more detail in Chapter 19. 

Finally, it has to be noted that the dimensionality of the input space is not always the crucial issue. 
In pattern recognition, it has been shown that the critical factor is the so-called VC dimension of a 
classifier. In a number of classifiers, such as (generalized) linear classifiers or neural networks (to be 
considered in Chapter 18), the VC dimension is directly related to the dimensionality of the input 
space. However, one can design classifiers, such as the support vector machines (Chapter 1 1), whose 
performance is not directly related to the input space and they can be efficiently designed in spaces of 
very high (or even infinite) dimensionality [37,40]. 


3.13 VALIDATION 

From previous sections, we already know that what is a “good” estimate according to one set of training 
points is not necessarily a good one for other data sets. This is an important aspect in any machine learn¬ 
ing task; the performance of a method may vary with the random choice of the training set. A major 
phase, in any machine learning task, is to quantify/predict the performance that the designed (pre- 
diction) model is expected to exhibit in practice. It will not come as a surprise that “measuring” the 
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FIGURE 3.12 

The training error tends to zero as the model complexity increases; for complex enough models with a large number 
of free parameters, a perfect fit to the training data is possible. However, the test error initially decreases, because 
more complex models “learn” the data better, up to a certain point. After that point of complexity, the test error 
increases. 


performance against the training data set would lead to an “optimistic” value of the performance index, 
because this is computed on the same set on which the estimate was optimized; this trend has been 
known since the early 1930s [22]. For example, if the model is complex enough, with a large num¬ 
ber of free parameters, the training error may even become zero, since a perfect fit to the data can be 
achieved. What is more meaningful and fair is to look for the so-called generalization performance of 
an estimator, that is, its average performance computed over different data sets which did not partic¬ 
ipate in the training (see the last paragraph of Section 3.9.2). The error associated with this average 
performance is known as the test error or the generalization error. 

Fig. 3.12 shows a typical performance that is expected to resuit in practice. The error measured on 
the (single) training data set is shown together with the (average) test error as the model complexity 
varies. If one tries to fit a complex model, with respect to the size of the available training set, then 
the error measured on the training set will be overoptimistic. On the contrary, the true error, as this is 
represented by the test error, takes large values; in the case where the performance index is the MSE, 
this is mainly contributed by the variance term (Section 3.9.2). On the other hand, if the model is too 
simple the test error will also attain large values; for the MSE case, this time the contribution is mainly 
due to the bias term. The idea is to have a model complexity that corresponds to the minimum of the 
respective curve. As a matter of fact, this is the point that various model selection techniques try to 
predict. 

For some simple cases and under certain assumptions concerning the underlying models, we are 
able to have analytical formulas that quantify the average performance as we change data sets. How¬ 
ever, in practice, this is hardly ever the case, and one must have a way to test the performance of an 


s Note that some authors use the term generalization error to denote the difference between the test and the training errors. 
Another term for this difference is generalization gap. 
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obtained classifier/predictor using different data sets. This process is known as validation, and there 
are a number of alternatives that one can resort to. 

Assuming that enough data are at the designer’s disposal, one can split the data into one part, to be 
used for training, and another part for testing the performance. For example, the probability of error is 
computed over the test data set for the case of a classiher, or the MSE for the case of a regression task; 
other measures of fit can also be used. If this path is taken, one has to make sure that both the size of the 
training set and the size of the test set are large enough with respect to the model complexity; a large test 
data set is required in order to provide a statistically sound resuit on the test error. Especially if different 
methods are compared, the smaller the difference in their comparative performance is expected to be, 
the larger the size of the test set must be, in order to guarantee reliable conclusions [37, Chapter 10]. 

CROSS-VALIDATION 

In practice, very often the size of the available data set is not sufficient and one cannot afford to “lose” 
part of it from the training set for the sake of testing. Cross-validation is a very common technique 
that is usually employed. Cross-validation has been rediscovered a number of times; however, to our 
knowledge, the first published description can be traced back to [25]. According to this method, the 
data set is split into K roughly equal-sized parts. We repeat training K times, each time selecting one 
(different each time) part of the data for testing and the remaining K — I parts for training. This gives 
us the advantage of testing with a part of the data that has not been involved in the training, so it can 
be considered independent, and at the same time using, eventually, all the data, both for training and 
testing. Once we hnish, we can (a) combine the obtained K estimates by averaging or via another 
more advanced way and (b) combine the errors from the test sets to get a better estimate of the test 
error that our estimator is expected to exhibit in real-life applications. This method is known as K -fold 
cross-validation. An extreme case is when we use K — N , so that each time one sample is left for 
testing. This is sometimes referred to as the leave-one-out (LOO) cross-validation method. The price 
one pays for K -fold cross-validation is the complexity of training K times. In practice, the value of K 
depends very much on the application, but typical values are of the order of 5 to 10. 

The cross-validation estimator of the test error is very nearly unbiased. The reason for the slight 
bias is that the training set in cross-validation is slightly smaller than the actual data set. The effect of 
this bias will be conservative in the sense that the estimated fit will be slightly biased in the direction 
suggesting a poorer fit. In practice, this bias is rarely a concern, especially in the LOO case, where 
each time only one sample is left out. The variance, however, of the cross-validation estimator can 
be large, and this has to be taken into account when comparing different methods. In [13], the use of 
bootstrap techniques is suggested in order to reduce the variance of the obtained error predictions by 
the cross-validation method. 

Moreover, besides complexity and high variance, cross-validation schemes are not beyond criti- 
cisms. Unfortunately, the overlap among the training sets introduces unknowable dependencies be- 
tween runs, making the use of formal statistical tests difficult [11]. All this discussion reveals that the 
validation task is far from innocent. Ideally, one should have at her/his disposal large data sets and di¬ 
vide them in several nonoverlapping training sets, of whatever size is appropriate, along with separate 
test sets (or a single one) that are (is) large enough. More on different validation schemes and their 
properties can be found in, e.g., [3,12,17,37] and an insightful related discussion provided in [26]. 
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3.14 EXPECTED LOSS AND EMPIRICAL RISK FUNCTIONS 

What was said before in our discussion concerning the generalization and the training set-based perfor- 
mance of an estimator can be given a more formal statement via the notion of expected loss. Sometimes 
the expected loss is referred to as the risk function. Adopting a loss function, £(■, •), in order to quantify 
the deviation between the predicted value, y = f(x), and the respective true one, y, the corresponding 
expected loss is defined as 

/(/):= E [£(y,/(*))], (3.78) 

or more explicitly 


J{f) = II £{y, f(x))p(y,x)dydx : 


expected loss function, 


(3.79) 


where the integration is replaced by summation whenever the respective variables are discrete. As a 
matter of fact, this is the ideal cost function one would like to optimize with respect to /(■), in order to 
get the optimal estimator over ali possible values of the input-output pairs. However, such an optimiza- 
tion would in general be a very hard task, even if one knew the functional form of the joint distribution. 
Thus, in practice, one has to be content with two approximations. First, the functions to be searched are 
constrained within a certain family T (in this chapter, we focused on parametrically described families 
of functions). Second, because the joint distribution is either unknown and/or the integration may not 
be analytically tractable, the expected loss is approximated by the so-called empirical risk version, 
defined as 


n= 1 


empirical risk function. 


(3.80) 


As an example, the MSE function, discussed earlier, is the expected loss associated with the squared 
error loss function and the sum of squared errors cost is the respective empirical version. For large 
enough values of N and provided that the family of functions is restricted enough, we expect the 
outcome from optimizing Jm to be close to that which would be obtained by optimizing J (see, e.g., 
[40]). 

From the validation point of view, given any prediction function /(■), what we called test error 
corresponds to the corresponding value of J in Eq. (3.79) and what we called the training error corre- 
sponds to that of in Eq. (3.80). 

We can now take the discussion a little further, and we will reveal some more secrets concerning the 
accuracy-complexity tradeoff in machine learning. Let /* be the function that optimizes the expected 
loss, 


/* := argmin J (/), 


(3.81) 


9 That is, the family of functions is not very large. To keep the discussion simple, take the example of the quadratic class of 
functions. This is larger than that of the linear ones, because the latter is a special case (subset) of the former. 







3.14 EXPECTED LOSS AND EMPIRICAL RISK FUNCTIONS 113 


and let fjr be the optimum after constraining the task within the family of functions J~, 


fjr := arg min /(/). (3.82) 

.feT 

Let us also deline 

f N := arg min J N (f)- (3.83) 

feT 


Then, we can readily write that 


E [/(/at) - /(/*)] = E [J(fjr) - J{U) 


approximation error 


e [/(/*)-/(/»]. 

' 

estimation error 


(3.84) 


The approximation error measures the deviation in the test error if instead of the overall optimal func- 
tion one uses the optimal obtained within a certain family of functions. The estimation error measures 
the deviation due to optimizing the empirical risk instead of the expected loss. If one chooses the fam¬ 
ily of functions to be very large, then it is expected that the approximation error will be small, because 
there is a high probability /* will be close to one of the members of the family. However, the estimation 
error is expected to be large, because for a fixed number of data points N, fitting a complex function 
is likely to lead to overfitting. For example, if the family of functions is the class of polynomials of 
a very large order, a very large number of parameters are to be estimated and overfitting will occur. 
The opposite is true if the class of functions is a small one. In parametric modeling, complexity of a 
family of functions is related to the number of free parameters. However, this is not the whole story. 
As a matter of fact, complexity is really measured by the so-called capacity of the associated set of 
functions. The VC dimension mentioned in Section 3.12 is directly related to the capacity of the family 
of the considered classifiers. More concerning the theoretical treatment of these issues can be obtained 
from [10,40,41]. 


LEARNABILITY 

In a general setting, an issue of fundamental importance is whether the expected loss in Eq. (3.79) can 
be minimized to within an arbitrary precision based on only a finite set of observations, (y n , x„ ), n = 
1, 2,..., N, as N —> oo. The issue here is not an algorithmic one, that is, how efficiently this can be 
done. The learnability issue, as it is called, refers to whether it is statistically possible to employ the 
empirical risk function in Eq. (3.80) in place of the expected cost function. 

It has been established that for the supervised classification and regression problems, a task is 
learnable if and only if the empirical cost, 7\/(/), converges uniformly to the expected loss function 
for ali f e T (see, e.g., [2,7]). However, this is not necessarily the case for other learning tasks. As 
a matter of fact, it can be shown that there exist tasks that cannot be learned via the empirical risk 
function. Instead, they can be learned via alternative mechanisms (see, e.g., [34]). For such cases, the 
notion of the algorithmic stability replaces that of uniform convergence. 
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3.15 NONPARAMETRIC MODELING AND ESTIMATION 

The focus of this chapter was on the task of parameter estimation and on techniques that spring from 
the idea of parametric functional modeling of an input-output dependence. However, as said in the 
beginning of the chapter, besides parametric modeling, an alternative philosophy that runs across the 
field of statistical estimation is that of nonpcirametric modeling. This alternative path presents itself 
with two faces. 

In its most classical version, the estimation task involves no parameters. Typical examples of 
such methods are the histogram approximation of an unknown distribution, the closely related Parzen 
Windows method, [28], and the k-nearest neighbor density estimation (see, e.g., [37,38]). The latter 
approach is related to one of the most widely known and used methods for classification, known as the 
k-nearest neighbor classification rule, which is discussed in Chapter 7. For the sake of completeness, 
the basic rationale behind such methods is provided in the Additional Material part that is related to 
the current chapter and can be downloaded from the site of the book. 

The other path of nonparametric modeling is when parameters pop in, yet their number is not 
fixed and a priori selected, but it grows with the number of training examples. We will treat such 
models in the context of reproducing kernel Hilbert spaces (RKHSs) in Chapter 11. There, instead of 
parameterizing the family of functions, in which one constrains the search for finding the prediction 
model, the candidate solution is constrained to lie within a speci fi c functional space. Nonparametric 
models in the context of Bayesian learning are also treated in Chapter 13. 


PROBLEMS 


3.1 Prove the least squares optimal solution for the linear regression case given in Eq. (3.13). 

3.2 Let 0;, i = 1,2,..., m, be unbiased estimators of a parameter vector 0, so that E[0,] = 0, i — 
1 ,... ,m. Moreover, assume that the respective estimators are uncorrelated to each other and 
that ali have the same (total) variance, cr 2 = E[(0/ — 0) r (0,- — 0)]. Show that by averaging the 
estimates, e.g., 


0 = 



! = 1 


the new estimator has total variance er 2 := E[(0 — 0 ) T (0 — 0)] — T-cr 2 . 

3.3 Let a random variable x being described by a uniform PDF in the interval [0, i], 9 > 0. Assume 
a function g , which defines an estimator 0:=g(x) of 9. Then, for such an estimator to be 
unbiased, the following must hold: 




g(x)dx = 1. 


However, show that such a function g does not exist. 


io 


To avoid any confusion, let g be Lebesgue integrable on intervals of R. 
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3.4 A family {p(V\ 0) : 0 e A} is called complete if, for any vector function h('D) such that 
Kx>[h(D)] = 0, V0, we have h — 0. 

Show that if { p(T >; 0): 0 e A\ is complete and there exists an MVU estimator, then this esti- 
mator is unique. 

3.5 Let 0 U be anunbiased estimator, so that E[0„] = 0 O . Dehne abiased one by 0/, = (1 + a)Q„. Show 
that the range of a where the MSE of 0* is smaller than that of 0 U is 


2MSE(0„) 
MSE(0„) + 02 


3.6 Show that for the setting of Problem 3.5, the optimal value of a is equal to 


a* 


1 


1 + 


% 

var(0„) 


where, of course, the variance of the unbiased estimator is equal to the corresponding MSE. 

3.7 Show that the regularity condition for the Cramer-Rao bound holds true if the order of integration 
and differentiation can be interchanged. 

3.8 Derive the Cramer-Rao bound for the LS estimator when the training data resuit from the linear 
model 


y n =0x n + rj n , n = 1,2,, 

where x n are i.i.d. samples of a zero mean random variable with variance a 2 and r/,, are i.i.d. 
noise samples drawn from a Gaussian with zero mean and variance cr.j. Assume also that x 
and q are independent. Then, show that the LS estimator achieves the Cramer-Rao bound only 
asymptotically. 

3.9 Let us consider the regression model 

y„ = 0 T x„ + r) n , n = 1,2,..., N, 

where the noise samples rj = [rp,..., rj^] T come from a zero mean Gaussian random vec¬ 
tor, with covariance matrix F t] . If X = (x i,.... jc ,y] r stands for the input matrix and y = 
[y\, ..., yiy] T , then show that 


^(Vr- 1 *) 1 X T E~ l y, 


is an efficient estimate. 

Note that the previous estimate coincides with the ML one. Moreover, bear in mind that in the 
case where = cr 2 /, the ML estimate becomes equal to the LS one. 

3.10 Assume a set of i.i.d. X = {x\, X 2 , ■ ■ ■, xn} samples of a random variable, with mean p and 
variance a 2 . Dehne also the quantities 

1 N i N 

Sfj, ■— — y ^,x n , S a 2 := — y^(x n — S^) , 

n =1 n =1 
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1 , ^ 

s a 2 '.=— ~ m) 2 - 

n=l 

Show that if /i is considered to be known, a sufficient statistic for a 2 is S a i. Moreover, in the 
case where both (/z, a 2 ) are unknown, a sufficient statistic is the pair ( S S^). 

3.11 Show that solving the task 


jv / i i 

minimize L(0, A.) = ^ I y„ — 6q — ^ 0iX n i ) + X ^ \0j | 2 

«=i \ i=l / i=i 

is equivalent to minimizing 

N / 1 \ 2 l 

minimize L(6, X) = ^ I (y n - y) - ^ 0/ - x t ) J +k^|0;| 2 - 

«=i \ i=i / i=i 


and the estimate of do is given by 


1 

% = y - 

1=1 


3.12 This problem refers to Example 3.4, where a linear regression task with a real-valued unknown 
parameter 0 o is considered. Show that MSE(9*(k)) < MSE(9 mvu) °r the ridge regression esti¬ 
mate shows a lower MSE performance than the one for the MYU estimate, if 


X e (0, oo). 


X e 


/ 

0, 

V 


2a 2 


( 7.7 

0 O 2 - — 
° N 



7 

n G n 

eI > -5-. 

° N 


Moreover, the minimum MSE performance for the ridge regression estimate is attained at X* = 



3.13 Consider, once more, the same regression model as that of Problem 3.9, but with = /y. 
Compute the MSE of the predictions E[(y — y) 2 ], where y is the true response and y is the 
predicted value, given a test point x and using the LS estimator 


0 = 



X T y. 


The LS estimator has been obtained via a set of N measurements, collected in the input matrix 
X and y, where the notation has been introduced previously in this chapter. The expectation E[-] 
is taken with respect to y, the training data T >, and the test points x. Observe the dependence of 
the MSE on the dimensionality of the space. 
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Hint: Consider, first, the MSE, given the value of a test point x, and then take the average over 
all the test points. 

3.14 Assume that the model that generates the data is 


y n — A sin 



kn + 4> 


+ /;« , 


where A > 0, and k e [1,2,..., N — 1}. Assume that are i.i.d. samples from a Gaussian noise 
of variance er “. Show that there is no unbiased estimator for the phase <p based on N measurement 
points, y n ,n = 0, 1,... N — 1, that attains the Cramer-Rao bound. 

3.15 Show that if (y, x) are two jointly distributed random vectors, with values in R k x K 7 , then the 
MSE optimal estimator of y given the value x = x is the regression of y conditioned on x, or 
E[y|x]. 

3.16 Assume that x, y are jointly Gaussian random vectors, with covariance matrix 


E :=E 


Tx-/g1 

.Ly-^yJ 


[(x-/G) r , (y - l*y) T ] 


ll _ 

'Ex 

E x y 

JJ 

Eyx 

Ey. 


Assuming also that the matrices E x and E := Ey — Ey X E x 1 E X y are nonsingular, show that the 
optimal MSE estimator E[v|x] takes the following form: 


E[y|x] = E[y] + E yx E x 1 (x - n x ). 


Note that E[y|x] is an affine function of x. In other words, for the case where x and y are jointly 
Gaussian, the optimal estimator of y, in the MSE sense, which is in general a nonlinear function, 
becomes an affine function of x. 

In the special case where x, y are scalar random variables, we have 

aer v 

E[y|x] = [i y H- (x - fi x ), 

where a stands for the correlation coefficient, defined as 

E [(x — ti x )(y — /r y )] 

a — --— ± , 

er v er v 

with |cx| < 1. Note, also, that the previous assumption on the nonsingularity of E x and E trans- 
lates, in this special case, to a x ^ 0 ^ o y . 

Hint: Use the matrix inversion lemma from Appendix A, in terms of the Schur complement E 
of E x in E and the fact that det(U) = det(Z’ y )det(X’). 

3.17 Assume a number / of jointly Gaussian random variables {xi, X 2 ,..., x/J and a nonsingular ma¬ 
trix A e K /x7 . If x := [xj, X 2 ,..., x/] r , then show that the components of the vector y, obtained 
by y = Ax, are also jointly Gaussian random variables. 

A direct consequence of this resuit is that any linear combination of jointly Gaussian variables 
is also Gaussian. 
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3.18 Let x be a vector of jointly Gaussian random variables of covariance matrix U x . Consider the 
general linear regression model 

y = ©x + m, 

where © e R kxl is a parameter matrix and r| is the noise vector which is considered to be Gaus¬ 
sian, with zero mean, and with covariance matrix S, p independent of x. Then show that y and x 
are jointly Gaussian, with the covariance matrix given by 


r= ®s x & t + e„ ®z x 
L Zx _ ' 

3.19 Show that a linear combination of Gaussian independent variables is also Gaussian. 

3.20 Show that if a sufficient statistic T(X) for a parameter estimation problem exists, then T(X) 
suffices to express the respective ML estimate, 

3.21 Show that if an efficient estimator exists, then it is also optimal in the ML sense. 

3.22 Let the observations resulting from an experiment be x„, n — 1,2,..., N. Assume that they are 
independent and that they originate from a Gaussian PDF Af(n,(T 2 ). Both the mean and the 
variance are unknown. Prove that the ML estimates of these quantities are given by 


A ML = 



/i=l 


»2 

ct ML ' 


1 

N 


N 

^ ' (x n Aml)~- 
//=1 


3.23 Let the observations x n ,n— 1,2 ,,N, come from the uniform distribution 


P(x; 9) = 


1 

e’ 

o. 


0 <x <6, 

otherwise. 


Obtain the ML estimate of 9. 

3.24 Obtain the ML estimate of the parameter X > 0 of the exponential distribution 


I A.exp(— Xx), x > 0, 

n n 

0, x < 0, 

based on a set of measurements x n , n = 1,2,..., /V. 

3.25 Assume a /x ~ Af(no, <Tq) anc ^ a stochastic process {x„}^__ 00 , consisting of i.i.d. random vari¬ 
ables, such that p(x n |/x) = A/"(/i, <r 2 ). Consider N observations so that X := {x\,X 2 , ■ ■ ■, xn}, 
and prove that the posterior p{x\X),of any x = x no conditioned on X, turns out to be Gaussian 
with mean p. n and variance crr,, where 


Ii N 


N a q x ■ 


■ cr 2 M0 


a N 


2 2 
a~a 0 


Na 2 + a 2 


No 2 + o 2 
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3.26 Show that for the linear regression model, 

y = XO + r), 

the a posteriori probability p(0\y) is a Gaussian one if the prior distribution probability is 
given by p(6) = N{6 o, Xo), and the noise samples follow the multivariate Gaussian distribution 
p(ri) = /V(0, E rj ). Compute the mean vector and the covariance matrix of the posterior distribu¬ 
tion. 

3.27 Assume that x n , n = 1.2,..., /V, are i.i.d. observations from a Gaussian N(p, cr 2 ). Obtain the 
MAP estimate of p if the prior follows the exponential distribution 

p(p) = Lexp(— Xp ), X>0,p>0. 


MATLAB® EXERCISES 

3.28 Write a MATLAB program to reproduce the results and figures of Example 3.1. Play with the 
value of the noise variance. 

3.29 Write a MATLAB program to reproduce the results of Example 3.6. Play with the number of 
training points, the degrees of the involved polynomials, and the noise variance in the regression 
model. 
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4.1 INTRODUCTION 

Mean-square error (MSE) linear estimation is a topic of fundamental importance for parameter esti¬ 
mation in statistical learning. Besides historical reasons, which take us back to the pioneering works 
of Kolmogorov, Wiener, and Kalman, who laid the foundations of the optimal estimation field, under- 
standing MSE estimation is a must, prior to studying more recent techniques. One always has to grasp 
the basies and learn the classics prior to getting involved with new “adventures.” Many of the concepts 
to be discussed in this chapter are also used in the next chapters. 
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Optimizing via a loss function that builds around the square of the error has a number of advantages, 
such as a single optimal value, which can be obtained via the solution of a linear set of equations; this 
is a very attractive feature in practice. Moreover, due to the relative simplicity of the resulting equa¬ 
tions, the newcomer in the field can get a better feeling of the various notions associated with optimal 
parameter estimation. The elegant geometric interpretation of the MSE solution, via the orthogonality 
theorem, is presented and discussed. In the chapter, emphasis is also given to computational complex- 
ity issues while solving for the optimal solution. The essence behind these techniques remains exactly 
the same as that inspiring a number of computationally efficient schemes for Online learning, to be 
discussed later in this book. 

The development of the chapter is around real-valued variables, something that will be true for 
most of the book. However, complex-valued signals are particularly useful in a number of areas, with 
Communications being a typical example, and the generalization from the real to the complex domain 
may not always be trivial. Although in most of the cases the difference lies in changing matrix trans- 
positions by Hermitian ones, this is not the whole story. This is the reason that we chose to deal with 
complex-valued data in separate sections, whenever the differences from the real data are not trivial 
and some subtle issues are involved. 


4.2 MEAN-SQUARE ERROR LINEAR ESTIMATION: 
THE NORMAL EQUATIONS 


The general estimation task has been introduced in Chapter 3. There, it was stated that given two 
dependent random vectors, y and x, the goal of the estimation task is to obtain a function, g, so as, 
given a value x of x, to be able to predict (estimate), in some optimal sense, the corresponding value 
y of y, or y = g(x). The MSE estimation was also presented in Chapter 3 and it was shown that the 
optimal MSE estimate of y given the value x = x is 


y = E[y|x]. 


In general, this is a nonlinear function. We now turn our attention to the case where g is constrained 
to be a linear function. For simplicity and in order to pay more attention to the concepts, we will 
restrict our discussion to the case of scalar dependent (output) variables. The more general case will be 
discussed later on. 


Let (y, x) e R x R / be two jointly distributed random entities of zero mean values. In case the 
mean values are not zero, they are subtracted. Our goal is to obtain an estimate of 6 e M. 1 in the linear 
estimator model. 



(4.1) 


so that 



(4.2) 


is minimum, or 


0* := argmin J(0). 
6 


(4.3) 
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In other words, the optimal estimator is chosen so as to minimize the variance of the error random 
variable 

e = y - y. (4.4) 

Minimizing the cost function J(0) is equivalent to setting its gradient with respect to 0 equal to zero 
(see Appendix A), 

V;(fl) = VE^y-« r x) (y-x r 6>) 

= V {IE[y 2 ] - 20 T E[xy] + 0 T E[xx r ]0} 

= -2p + 2E X 0 = O 


or 

E x 0 t = p: normal equations, 
where the input-output cross-correlation vector p is given by 

p= [E[xiy],...,E[x/y]] / = E[xy], 


(4.5) 


(4.6) 


and the respective covariance matrix is given by 

E x = E [xx r ] . 

Thus, the weights of the optimal linear estimator are obtained via a linear system of equations, 
provided that the covariance matrix is positive definite and hence it can be inverted (Appendix A). 
Moreover, in this case, the solution is unique. On the contrary, if E x is singular and hence cannot be 
inverted, there are infinitely many Solutions (Problem 4.1). 


4.2.1 THE COST FUNCTION SURFACE 

Elaborating on the cost function J{0), as defined in (4.2), we get 

J{0) = a* - 20 t p + 0 T E x 0. (4.7) 


Adding and subtracting the term 0 1 . E x 0 * and taking into account the definition of 0* from (4.5), it is 
readily seen that 


J(0) = /(0*) + (0 - 0*) t E x (0 - 6>*), 


(4.8) 


where 


= a y - P T Z; l P = Oy - 9l Z x 0* = (Ty - P T 0 * (4.9) 

is the minimum achieved at the optimal solution. From (4.8) and (4.9), the following remarks can be 
made. 


The cross-correlation vector is often denoted as r X y- Here we will use p, in order to simplify the notation. 


1 
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Remarks 4.1. 

• The cost at the optimal value 0 * is always less than the variance E[y 2 ] of the output variable. This 
is guaranteed by the positive definite nature of E x or , which makes the second term on the 
right-hand side in (4.9) always positive, unless p — 0; however, the cross-correlation vector will 
only be zero if x and y are uncorrelated. Well, in this case, one cannot say anything (make any 
prediction) about y by observing samples of x, at least as far as the MSE criterion is concerned, 
which turns out to involve information residing up to the second-order statistics. In this case, the 
variance of the error, which coincides with J(0 *), will be equal to the variance er 2 ; the latter is 
a measure of the “intrinsic” uncertainty of y around its (zero) mean value. On the contrary, if the 
input-output variables are correlated, then observing x removes part of the uncertainty associated 
with y. 

• For any value 6 other than the optimal 0 *, the error variance increases as (4.8) suggests, due to 
the positive definite nature of S x . Fig. 4.1 shows the cost function (MSE) surface defined by J(0) 
in (4.8). The corresponding isovalue contours are shown in Fig. 4.2. In general, they are ellipses, 
whose axes are determined by the eigenstructure of S x . For E x = a 2 1 , where ali eigenvalues are 
equal to a 2 , the contours are circles (Problem 4.3). 


4.3 A GEOMETRIC VIEWPOINT: ORTHOGONAUTY C0NDITI0N 

A very intuitive view of what we have said so far comes from the geometric interpretation of the 
random variables. The reader can easily check out that the set of random variables is a linear space 
over the field of real (and complex) numbers. Indeed, if x and y are any two random variables then 
x + y, as well as ax, are also random variables for every a e R. We can now equip this linear space 
with an inner product operation, which also implies a norm and makes it an inner product space. The 



FIGURE 4.1 

The MSE cost function has the form of a (hyper)paraboloid. 


- These operations also satisfy all the properties required for a set to be a linear space, including associativity, commutativity, 
and so on (see [47] and the appendix associated with Chapter 8). 
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FIGURE 4.2 

The isovalue contours for the cost function surface corresponding to Fig. 4.1. They are ellipses; the major and the 
minor axes of each ellipse are determined by the maximum and minimum eigenvalues, k max and A. ln in, of the covari- 
ance matrix, E, of the input random variables. The largest the ratio is, the more elongated the ellipses are. The 
ellipses become circles if the covariance matrix has the special form of a 2 1. That is, ali variables are mutually un- 
correlated and they have the same variance. By varying E, different shapes of the ellipses and different orientations 
resuit. 


reader can easily check that the mean value operation has all the properties required for an operation 
to be called an inner product. Indeed, for any subset of random variables, 

• E[xy] = E[yx], 

• E[(aixi +a 2 X2)y] = aiE[xiy] + ce2E[x2y], 

• E[x 2 ] > 0, with equality if and only if x = 0. 

Thus, the norm induced by this inner product, 

11x11 := VE[x 2 ], 

coincides with the respective Standard deviation (assuming E[x] = 0). From now on, given two uncor- 
related random variables x, y, or E[xy] = 0, we can call them orthogonal, because their inner product 
is zero. We are now free to apply to our task of interest the orthogonality theorem, which is known to 
us from our familiar finite-dimensional (Euclidean) linear (vector) spaces. 

Let us now rewrite (4.1) as 


y = 0ixi H-b 0/x/. 

Thus, the random variable, y, which is now interpreted as a point in a vector space, results as a linear 
combination of Z elements in this space. Thus, the variable y will necessarily lie in the subspace spanned 
by these points. In contrast, the true variable, y, will not lie, in general, in this subspace. Because our 
goal is to obtain a y that is a good approximation of y, we have to seek the specific linear combination 
that makes the norm of the error, e = y — y, minimum. This specific linear combination corresponds to 
the orthogonal projection of y onto the subspace spanned by the points X|, x 2 ,..., x/. This is equivalent 
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v 



FIGURE 4.3 


Projecting y on the subspace spanned by xi, X 2 (shaded plane) guarantees that the deviation between y and y corre- 
sponds to the minimum MSE. 


to requiring 


E[ex,t] = 0, k = 1: orthogonality condition. 


(4.10) 


The error variable being orthogonal to every point x^, k — 1,2,...,/, will be orthogonal to the respec- 
tive subspace. This is illustrated in Fig. 4.3. Such a choice guarantees that the resulting error will have 
the minimum norm; by the definition of the norm, this corresponds to the minimum MSE, i.e., to the 
minimum E [e 2 ]. 

The set of equations in (4.10) can now be written as 



or 



(4.11) 


which leads to the linear set of equations in (4.5). 

This is the reason that this elegant set of equations is known as normal equations. Another 
name is Wiener-Hopf equations. Strictly speaking, the Wiener-Hopf equations were first derived 
for continuous-time processes in the context of the causal estimation task [49,50]; for a discussion see 
[16,44]. 

Nobert Wiener was a mathematician and philosopher. He was awarded a PhD at Harvard at the 
age of 17 in mathematical logic. During the Second World War, he laid the foundations of linear 
estimation theory in a classified work, independently of Kolmogorov. Later on, Wiener was involved in 
pioneering work embracing automation, artihcial intelligence, and cognitive Science. Being a pacifist, 
he was regarded with suspicion during the Cold War years. 

The other pillar on which linear estimation theory is based is the pioneering work of Andrey 
Nikolaevich Kolmogorov (1903-1987) [24], who developed his theory independent of Wiener. Kol- 
mogorov’s contributions cover a wide range of topics in mathematics, including probability, computa- 
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tional complexity, and topology. He is the father of the modern axiomatic foundation of the notion of 
probability (see Chapter 2). 

Remarks 4.2. 

• So far, in our theoretical findings, we have assumed that x and y are jointly distributed (correlated) 
variables. If, in addition, we assume that they are linearly related according to the linear regression 
model, 

y = 6>Jx + r 1 , 0 o eR k , (4.12) 

where r| is a zero mean noise variable independent of x, then, if the dimension k of the true system 
0 o is equal to the number of parameters, Z, adopted for the model, so that k — Z, then it turns out that 
(Problem 4.4) 

O*=0 o , 

and the optimal MSE is equal to the variance of the noise, a 2 . 

• Undermodeling. If k > Z, then the order of the model is less than that of the true system, which 
relates y and x in (4.12); this is known as undermodeling. It is easy to show that if the variables 
comprising x are uncorrelated, then (Problem 4.5) 

where 


In other words, the MSE optimal estimator identifies the first Z components of 0 O . 


0 l o eR', 0 2 o eR k ~'. 


4.4 EXTENSION TO COMPLEX-VALUED VARIABLES 

Everything that has been said so far can be extended to complex-valued signals. However, there are a 
few subtle points involved and this is the reason that we chose to treat this case separately. Complex- 
valued variables are very common in a number of applications, as for example in Communications, e.g., 
[41]. 

Given two real-valued variables, ( x , y), one can either consider them as a vector quantity in the 
two-dimensional space, [x, y] r , or describe them as a complex variable, z = x + jy, where j 2 —1. 
Adopting the latter approach offers the luxury of exploiting the operations available in the field C of 
complex numbers, i.e., multiplication and division. The existence of such operations greatly facilitates 
the algebraic manipulations. Recall that such operations are not defined in vector spaces. 


' Multiplication and division can also be defined for groups of four variables, (x. 0, z. y). known as quatemions; the related 
algebra was introduced by Hamilton in 1843. The real and complex numbers as well as quaternions are ali special cases of the 
so-called Clifford algebras [39]. 
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Let us assume that we are given a complex-valued (output) random variable 

y-=y r + jyi (4.13) 

and a complex-valued (input) random vector 

x= x r + jxf. (4.14) 


The quantities y r ,y ;,x r , and x, are real-valued random variables/vectors. The goal is to compute a 
linear estimator defined by a complex-valued parameter vector 0 = 0,+ j0i e CE so as to minimize 
the respective MSE, 


E 



:=E[ee*] = E [ly - 0 w x| 2 ] . 


(4.15) 


Looking at (4.15), it is readily observed that in the case of complex variables the inner product 
operation between two complex-valued random variables should be defined as E[xy*], so as to guar- 
antee that the implied norm by the inner product, ||x|| = ^/E[xx*], is a valid quantity. Applying the 
orthogonality condition as before, we rederive the normal equations as in (4.11), 


ZxO* = P< 


(4.16) 


where now the covariance matrix and cross-correlation vector are given by 

r* =e[xx // ] , (4.17) 

j? = E [xy*]. (4.18) 

Note that (4.16)—(4.18) can alternatively be obtained by minimizing (4.15) (Problem 4.6). Moreover, 
the counterpart of (4.9) is given by 

J(0*) = <*y - P H ~ P H 0*- (4-19) 

Using the definitions in (4.13) and (4.14), the cost in (4.15) is written as 

J (0) = E[|e| 2 ] = E[|y — y| 2 ] 

= E[|y r — y r | 2 ] +E[|y,- — y;| 2 ]> (4.20) 


where 


y := y,- + /y, = 0 H x : complex linear estimator. 


or 


(4.21) 


y = (Of - j0j )(x r + jxi) 

= {0 T r x,- + 0f X/) + j (0j Xj - of x r ). (4.22) 


Eq. (4.22) reveals the true flavor behind the complex notation; that is, its multichannel nature. In multi- 
channel estimation, we are given more than one set of input variables, namely, x r and x,, and we want 
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to generate, jointly, more than one output variable, namely, y r and y,. Eq. (4.22) can equivalently be 
written as 


where 


"y r ’ 

= © 

X,- 

. y i. 


. X ' . 


©:= 


0 T r °I 

-ei el 


( 4 . 23 ) 


( 4 . 24 ) 


Multichannel estimation can be generalized to more than two outputs and to more than two input sets 
of variables. We will come back to the more general multichannel estimation task toward the end of 
this chapter. 

Looking at (4.23), we observe that starting from the direct generalization of the linear estimation 
task for real-valued signals, which led to the adoption of y = 9 u x, resulted in a matrix, 0, of a very 
special structure. 


4.4.1 WIDELY LINEAR COMPLEX-VALUED ESTIMATION 

Let us define the linear two-channel estimation task starting from the definition of a linear operation 
in vector spaces. The task is to generate a vector output, y= [y,., y,] 7 ' et 2 , from the input vector 
variables, x = [xj, xJ] T e R 2/ , via the linear operation 



where 


0 := 


<1 


'21 


e 

e 


T 

12 

T 

22 


and compute the matrix © so as to minimize the total error variance 


0* := argmin {e [(y r - y,-) 2 + E [(y,- - y, ) 2 ]}. 


Note that (4.27) can equivalently be written as 

0* := argmin J E [e 7 e] J = argmin jtrace {E [ee 7 ]} J 


( 4 . 25 ) 


( 4 . 26 ) 


( 4 . 27 ) 


where 


e:=y-y 

Minimizing (4.27) is equivalent to minimizing the two terms individually; in other words, treating 
each channel separately (Problem 4.7). Thus, the task can be tackled by solving two sets of normal 
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equations, namely. 


where 



E[x,.x, r ] E[x,-xf] 


H r XV i 

_E[x,x, 7 ’] E[x ; -x, r ]_ 


_ -X,> Si 


(4.28) 


(4.29) 


and 


p r :=E 


x,.y r 

x,y,- 


Pi :=E 


x,-y i 
x,y i 


(4.30) 


The obvious question that is now raised is whether we can tackle this more general task of the 
two-channel linear estimation task by employing complex-valued arithmetic. The answer is in the 
affirmative. Let us define 


and 


0 := 0, + j0{, v:=v r + jvi, 

X = x r + jXi. 


(4.31) 


Then we define 

0 r :=\(0n+022), 0r.= \(0n-0 2 i) (4.32) 

and 

v r ■= - O22), Vi :=--(O12 + O21). (4.33) 

Under the previous definitions, it is a matter of simple algebra (Problem 4.8) to prove that the set of 
equations in (4.25) is equivalent to 


y := y r + jy, = 0 H x + v H x* : widely linear complex estimator. 


(4.34) 


To distinguish from (4.21), this is known as widely linear complex-valued estimator. Note that in (4.34), 
x and its complex conjugate x* are simultaneously used in order to cover all possible Solutions, as those 
are dictated by the vector space description, which led to the formulation in (4.25). 


Circularity Conditions 

We now turn our attention to investigating conditions under which the widely linear formulation in 
(4.34) breaks down to (4.21); that is, the conditions for which the optimal widely linear estimator turns 
out to have v = 0. 
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Let 



0 ' 


X 

<P := 

and x := 


V 


X 


Then the widely linear estimator is written as 


y = <p H x. 


Adopting the orthogonality condition in its complex formulation 

E[xe*]=E[x (y-y)*]=0, 

we obtain the following set of normal equations for the optimal <p..: 


[xx"l < p* — E [xx^ 

'0*' 


' E[xy*]' 

L v* L J 

V * 


_E[x*y*]_ 


Px 

0* 


’ P ' 

P* E*_ 

t>* _ 


.q *. 


where E x and p have been defined in (4.17) and (4.18), respectively, and 

Px-= E[xx r l, q E[xy]. 


(4.35) 


(4.36) 


(4.37) 


The matrix P x is known as the pseudocovariance/autocorrelation matrix of x. Note that (4.36) is the 
equivalent of (4.28); to obtain the widely linear estimator, one needs to solve one set of complex-valued 
equations whose number is double compared to that of the linear (complex) formulation. 

Assume now that 


P x = O and q — 0 : circularity conditions. 


(4.38) 


We say that in this case, the input-output variables are jointly circular and the input variables in x obey 
the (second-order) circular condition. It is readily observed that under the previous circularity assump- 
tions, (4.36) leads to u* = 0 and the optimal 0* is given by the set of normal equations (4. 1 6)— (4. 1 8), 
which govern the more restricted linear case. Thus, adopting the linear formulation leads to optimality 
only under certain conditions, which do not always hold true in practice; a typical example of such vari¬ 
ables, which do not respect circularity, are met in fMRI imaging (see [ 1 ] and the references therein). It 
can be shown that the MSE achieved by a widely linear estimator is always less than or equal to that 
obtained via a linear one (Problem 4.9). 

The notions of circularity and of the widely linear estimation were treated in a series of fundamental 
papers [35,36]. A stronger condition for circularity is based on the PDF of a complex random variable: 
A random variable x is circular (or strictly circular) if x and xe^ are distributed according to the same 
PDF; that is, the PDF is rotationally invariant [35]. Fig. 4.4A shows the scatter plot of points generated 
by a circularly distributed variable, and Fig. 4.4B corresponds to a noncircular one. Striet circularity 
implies the second-order circularity, but the converse is not always true. For more on complex random 
variables, the interested reader may consuit [3,37]. In [28], it is pointed out that the full second-order 
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X\ 
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FIGURE 4.4 

Scatter plots of points corresponding to (A) a circular process and (B) a noncircular one, in the two-dimensional 
space. 


statistics of the error, without doubling the dimension, can be achieved if instead of the MSE one 
employs the Gaussian entropy criterion. 

Finally, note that substituting in (4.29) the second-order circularity conditions, given in (4.38), one 
obtains (Problem 4.10) 

S r = Si, S ri = -S ir , E[x r y r ] = E[x,-y,-], E[x,y r ] = -E[x r y,], (4.39) 

which then implies that 0 \ \ = 612 and #12 = —# 21 ; i n this case, (4.33) verifies that v = 0 and that the 
optimal in the MSE sense solution has the special structure of (4.23) and (4.24). 

4.4.2 OPTIMIZING WITH RESPECT TO COMPLEX-VALUED VARIABLES: 

WIRTINGER CALCULUS 

So far, in order to derive the estimates of the parameters, for both the linear as well as the widely linear 
estimators, the orthogonality condition was mobilized. For the complex linear estimation case, the nor- 
mal equations were derived in Problem 4.6, by direct minimization of the cost function in (4.20). Those 
who got involved with solving the problem have experienced a procedure that was more cumbersome 
compared to the real-valued linear estimation. This is because one has to use the real and imaginary 
parts of ali the involved complex variables and express the cost function in terms of the equivalent 
real-valued quantities on/y; then the required gradients for the optimization have to be performed. Re- 
call that any complex function / : C —>■ M is not differentiable with respect to its complex argument, 
because the Cauchy-Riemann conditions are violated (Problem 4.11). The previously stated proce¬ 
dure of splitting the involved variables into their real and imaginary parts can become cumbersome 
with respect to algebraic manipulations. Wirtinger calculus provides an equivalent formulation that is 
based on simple rules and principies, which bear a great resemblance to the rules of Standard complex 
differentiation. 
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Let / : C i—> C be a complex function defined on C. Obviously, such a function can be regarded as 
either defined onK 2 or C (i.e., f(z) — f(x + jy) = f(x, y)). Furthermore, it may be regarded as either 
complex-valued, f(x,y) = f r (x,y) + jfi(x,y), or as vector-valued f(x,y) = (f r (x,y),fi(x,y)). 
We say that / is differentiable in the real sense if both /,• and f \ are differentiable. Wirtinger’s calculus 
considers the complex structure of f and the real derivatives are described using an equivalent formu- 
lation that greatly simplifies calculations; moreover, this formulation bears a surprising similarity to 
the complex derivatives. 

Definition 4.1. The Wirtinger derivative or W-derivative of a complex function / at a point zo e C is 
defined as 

jy(z 0 ) = \ (j^(z o) + ^(zo)j + J - - ffe,)) : W-derivative. 

The conjugate Wirtinger’s derivative or CW-clerivative of / at zo is defined as 

§ (zo) = \ - fr' ,z °>) + i (l <a>+ t (za> ) : cw - <leriMive - 

For some of the properties and the related proofs regarding Wirtinger’s derivatives, see Ap¬ 
pendix A.3. An important property for us is that if / is real-valued (i.e., C i—> R) and zo is a (local) 
optimal point of /, it turns out that 



df_ 

dz* 


(z o) = 0: 


optimality conditions. 


In order to apply Wirtinger’s derivatives, the following simple tricks are adopted: 


(4.40) 


• express function / in terms of z and z *; 

• to compute the W-derivative apply the usual differentiation rule, treating z* as a constant; 

• to compute the CW-derivative apply the usual differentiation rule, treating z as a constant. 


It should be emphasized that ali these statements must be regarded as useful computational tricks rather 
than rigorous mathematical rules. Analogous definitions and properties carry on for complex vectors 
z, and the W-gradient and CW-gradients 


V z /(zo), V,./(z 0 ) 


resuit from the respective definitions if partial derivatives are replaced by partial gradients, V A , V v . 

Although Wirtinger’s calculus has been known since 1927 [51], its use in applications has a rather 
recent history [ 7 ]. Its revival was ignited by the widely linear filtering concept [ 27 ]. The interested 
reader may obtain more on this issue from [ 2 , 25 , 30 ]. Extensions of Wirtinger’s derivative to general 
Hilbert (infinite-dimensional) spaces was done more recently in [6] and to the subgradient notion in 
[ 46 ]. 

Application in linear estimation. The cost function in this case is 


■/(6>,r) = E[|y —6>"x| 2 


= E ^y — 0 H xj (y*-0 r x*) 
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Thus, treating 0 as a constant, the optimal occurs at 

V r 7=E[xe*]=0, 

which is the orthogonality condition leading to the normal equations (4.16)-(4.18). 

Application in widely linear estimation. The cost function is now (see notation in (4.35)) 

J(9, V*) = E [(y - <p H i) (y* - <p T x *)], 


and treating cp as a constant, 


Vp* J = E [xe*] = E 


xe 


= 0 , 


which leads to the set derived in (4.36). 

Wirtinger’s calculus will prove very useful in subsequent chapters for deriving gradient opera- 
tions in the context of online/adaptive estimation in Euclidean as well as in reproducing kernel Hilbert 
spaces. 


4.5 LINEAR FILTERING 

Linear statistical filtering is an instance of the general estimation task, when the notion of time evolution 
needs to be taken into consideration and estimates are obtained at each time instant. There are three 
major types of problems that emerge: 

• Filtering , where the estimate at time instant n is based on ali previously received (measured) input 
information up to and including the current time index, n. 

• Smoothing , where data over a time interval [0, N] are first collected and an estimate is obtained at 
each time instant n < N, using all the available information in the interval [0, N ]. 

• Prediction, where estimates at times n + r, r > 0, are to be obtained based on the information up 
to and including time instant n. 

To fit in the above definitions more with what has been said so far in the chapter, take for exam- 
ple a time-varying case, where the output variable at time instant n is y n and its value depends on 
observations included in the corresponding input vector x„. In filtering, the latter can include measure- 
ments received only at time instants n, n — 1 ,..., 0 . This restriction in the index set is directly related 
to causality. In contrast, in smoothing, future time instants are included in addition to the past, i.e., 
..., n + 2, n + 1, n, n — 1. 

Most of the effort in this book will be spent on filtering whenever time information enters into the 
picture. The reason is that this is the most commonly encountered task and, also, the techniques used 
for smoothing and prediction are similar in nature to that of filtering, with usually minor modifications. 

In signal processing, the term filtering is usually used in a more specific context, and it refers to 
the operation of a filter, which acts on an input random process/signal (u„), to transform it into another 
one (d„); see Section 2.4.3. Note that we have switched into the notation, introduced in Chapter 2, used 
to denote random processes. We prefer to keep different notation for processes and random variables, 
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FIGURE 4.5 

In statistical filtering, the impulse response coefficients are estimated so as to minimize the error between the output 
and the desired response processes. In MSE linear filtering, the cost function is E[e^]. 


because in the case of random processes, the filtering task obtains a special structure and properties, as 
we will soon see. Moreover, although the mathematical formulation of the involved equations, for both 
cases, may end up to be the same, we feel that it is good for the reader to keep in mind that there is a 
different underlying mechanism for generating the data. 

The task in statistical linear filtering is to compute the coefficients (impulse response) of the filter 
so that the output process of the filter, d„, when the filter is excited by the input random process, u„, 
is as close as possible to a desired response process, d„. In other words, the goal is to minimize, in 
some sense, the corresponding error process (see Fig. 4.5). Assuming that the unknown filter is of a 
finite impulse response (FIR) (see Section 2.4.3 for related definitions), denoted as wq, w i, ..., w/-i, 
the output d„ of the filter is given as 


l-\ 

d„ = WjU n —i = w 7 u„ : convolution sum, 
;=o 


(4.41) 


where 

w = [w 0 ,wi,...,wi-i] T and u„ = [u„, u„_i,..., u„_/+i] 7 ’. (4.42) 

Fig. 4.6 illustrates the convolution operation of the linear filter, when its input is excited by a 
realization u n of the input processes to provide in the output the signal/sequence d„. 

Alternatively, (4.41) can be viewed as a linear estimator function; given the jointly distributed 
variables, at time instant n, (d„, u„), (4.41) provides the estimator, d„, given the values of u„ . In order 
to obtain the coefficients, w, the MSE criterion will be adopted. Furthermore, we will assume that: 






/s 


• i 1 > 

wo,m, ■ 

■,Wi- 1 

d n 




;-i 

'y ] 'WiUn—i 

i=0 


FIGURE 4.6 


The linear filter is excited by a realization of an input process. The output signal is the convolution between the 
input sequence and the filter’s impulse response. 
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• The processes u„, d„ are wide-sense stationary real random processes. 

• Their mean values are equal to zero, in other words, E[u„] = E[d„] = 0, V«. If this is not the case, 
we can subtract the respective mean values from the processes, u„ and d„, during a preprocessing 
stage. Due to this assumption, the autocorrelation and covariance matrices of u ;! coincide, so that 

Ru = E u . 


The normal equations in (4.5) now take the form 

E u w = p, 

where 

P = [E[u„d„],...,E[u„_/ + id„]] 7 , 

and the respective covariance/autocorrelation matrix, of order /, of the input process is given by 


E u := E [u„<] = 


/•( 0 ) 

r( 1) 


KD 

r( 0) 


... r(l — 1)" 
... r(l- 2 ) 


_r(l — 1) r(l — 2 ) ... r( 0 ) 


(4.43) 


where r(k) is the autocorrelation sequence of the input process. Because we have assumed that the 
involved processes are wide-sense stationary, we have 


r(n,n - k) :=E[u„u„_i] =r(k). 


Also, recall that, for real wide-sense stationary processes, the autocorrelation sequence is symmetric, 
or r(k) = r(—k) (Section 2.4.3). Observe that in this case, where the input vector results from a random 
process, the covariance matrix has a special structure, which will be exploited later on to derive efficient 
schemes for the solution of the normal equations. 

For the complex linear filtering case, the only differences are: 

• the output is given as d„ — w H u„, 

• p = E[u„d*], 

• E u =E[u„u,f], 

. r(-k) = r*(k). 


4.6 MSE LINEAR FILTERING: A FREQUENCY DOMAIN POINT OF VIEW 

Let us now turn our attention to the more general case, and assume that our filter is of infinite impulse 
response (IIR). Then (4.41) becomes 


d 


n 


+oo 

E 


i=—oo 


W/U„_/. 


(4.44) 
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Moreover, we have allowed the filter to be noncausal . 4 Following similar arguments as those used to 
prove the MSE optimality of E[y|jc] in Section 3.9.1, it turns out that the optimal filter coefficients 
must satisfy the following condition (Problem 4.12): 



r / +°° \ 


E 

1 ^ WiXXn—i 1 U/i —j 

_ \ /=—oo / 

— 0, j e Z. 


(4.45) 


Observe that this is a generalization (involving an infinite number of terms) of the orthogonality con¬ 
dition stated in (4.10). A rearrangement of the terms in (4.45) results in 


+oo 

E] Wj E[u„_,-u„_,-] = E[d„u„_y]. j e Z, (4.46) 

i=—oo 


and finally to 


4-oo 

E u ’i r (j - <) = r du(j), l'6Z, 
/ =—00 


(4.47) 


where rd u (j) denotes the cross-correlation sequence between the processes d„ and u„. Eq. (4.47) can 
be considered as the generalization of (4.5) to the case of random processes. The problem now is how 
one can solve (4.47) that involves infinite many parameters. The way out is to cross into the frequency 
domain. Eq. (4.47) can be seen as the convolution of the unknown sequence with the autocorrelation 
sequence of the input process, which gives rise to the cross-correlation sequence. However, we know 
that convolution of two sequences corresponds to the product of the respective Fourier transforms (e.g., 
[42] and Section 2.4.2). Thus, we can now write that 


W(co)S u (ca) = Sd u ((o), 


(4.48) 


where W{co) is the Fourier transform of the sequence of the unknown parameters and S u (co) is the 
power spectral density of the input process, defined in Section 2.4.3. In analogy, the Fourier transform 
Sdu (a>) of the cross-correlation sequence is known as the cross-spectral density. If the latter two quan- 
tities are available, then once W (a>) has been computed, the unknown parameters can be obtained via 
the inverse Fourier transform. 


DECONVOLUTION: IMAGE DEBLURRING 

We will now consider an important application in order to demonstrate the power of MSE linear esti- 
mation. Image deblurring is a typical deconvolution task. An image is degraded due to its transmission 
via a nonideal system; the task of deconvolution is to optimally recover (in the MSE sense in our case) 


4 A system is called causal if the output d n depends only on input values u m , m < n. A necessary and sufficient condition for 
causality is that the impulse response is zero for negative time instants, meaning that w n = 0, n < 0. This can easily be checked 
out; try it. 
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FIGURE 4.7 

(A) The original image and (B) its blurred and noisy version. 

the original undegraded image. Fig. 4.7A shows the original image and 4.7B a blurred version (e.g., 
taken by a nonsteady camera) with some small additive noise. 

At this point, it is interesting to recall that deconvolution is a process that our human brain performs 
ali the time. The human (and not only) vision system is one of the most complex and highly developed 
biological systems that has been formed over millions years of a continuous evolution process. Any 
raw image that falis on the retina of the eye is severely blurred. Thus, one of the main early processing 
activities of our visual system is to deblur it (see, e.g., [29] and the references therein for a related 
discussion). 

Before we proceed any further, the following assumptions are adopted: 

• The image is a wide-sense stationciry two-dimensional random process. Two-dimensional random 
processes are also known as random fields (see Chapter 15). 

• The image is of an infinite extent; this can be justified for the case of large irnages. This assumption 
will grant us the “permission” to use (4.48). The fact that an image is a two-dimensional process 
does not change anything in the theoretical analysis; the only difference is that now the Fourier 
transforms involve two frequency variables, o >\, co 2 , one for each of the two dimensions. 

A gray image is represented as a two-dimensional array. To stay close to the notation used so far, let 
d(n. m), n,m e Z, be the original undegraded image (which for us is now the desired response), and 
u{n , m), n.m e Z, be the degraded one, obtained as 


+OO +OO 



(4.49) 


l ——OO J = ~0 O 


where i](n, m) is the realization of a noise field, which is assumed to be zero mean and indepen- 
dent of the input (undegraded) image. The sequence h(i, j) is the point spread sequence (impulse 
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response) of the system (e.g., camera). We will assume that this is known and it has, somehow, been 
measured. 

Our task now is to estimate a two-dimensional filter, w(n, m ), which is applied to the degraded im- 
age to optimally reconstruet (in the MSE sense) the original undegraded image. In the current context, 
Eq. (4.48) is written as 

W{COI,C02)S U (COI,(02) = Sdu(0>l,C02)- 

Following similar arguments as those used to derive Eq. (2.130) of Chapter 2, it is shown that (Prob- 
lem 4.13) 

S c iu (oj\,o) 2) = H*(a>i,a>2)Sd(a>i,a>2) (4.50) 


and 


S u (coi,co 2 ) = a> 2 )\ 2 Sd(u>\, C 02 ) + S tl (a> 1 , « 2 ), 


(4.51) 


where “*” denotes complex conjugation and S v is the power spectral density of the noise field. Thus, 
we finally obtain 


W (u> 1 , C 02 ) 


1 \H(coi,a > 2 )\ 2 


H(coi,a> 2 ) \H((o l ,w 2 )\ 2 + 


S,,( 0 ) 1 ,( 02 ) ’ 
Sj(co\,a)2) 


(4.52) 


Once W(a>i, 022 ) has been computed, the unknown parameters could be obtained via an inverse (two- 
dimensional) Fourier transform. The deblurred image then results as 



FIGURE 4.8 

(A) The original image and (B) the deblurred one for C = 2.3 x 10~ 6 . Observe that in spite of the simplicity of the 
method, the reconstruction is pretty good. The differences become more obvious to the eye when the images are 
enlarged. 


5 


Note that this is not always the case. 
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-l-oo +oo 

d(n,m) = w(i, j)u.(n — i,m — j). (4.53) 

/=—oo j=—o o 

In practice, because we are not really interested in obtaining the weights of the deconvolution filter, we 
implement (4.53) in the frequency domain 

D(a> i, a) 2 ) = W(a> 1 , a> 2 )U ( u>\ , « 2 ), 

and then obtain the inverse Fourier transform. Thus all processing is efficiently performed in the 
frequency domain. Software packages to perform Fourier transforms (via the fast Fourier transform 
[FFT]) of an image array are “omnipresent” on the internet. 

Another important issue is that in practice we do not know Sd{co\, u> 2 )■ An approximation which is 
usually adopted and renders sensible results can be made by assuming that is a constant, C, 

and trying different values of it. Fig. 4.8 shows the deblurred image for C = 2.3 x 10 -6 . The quality 
of the end resuit depends a lot on the choice of this value (MATLAB® exercise 4.25). Other, more 
advanced, techniques have also been proposed. For example, one can get a better estimate of Sa (&>i, 022 ) 
by using information from S rl (oj ], a> 2 ) and S u (cl> 1 , ( 02 )■ The interested reader can obtain more on the 
image deconvolution/restoration task from, e.g., [14,34]. 


4.7 SOME TYPICAL APPLICATIONS 

Optimal linear estimation/filtering has been applied in a wide range of diverse applications of statistical 
learning, such as regression modeling, Communications, control, biomedical signal processing, seismic 
signal processing, and image processing. In the sequel, we present some typical applications in order 
for the reader to grasp the main rationale of how the previously stated theory can find its way in solving 
practical problems. In all cases, wide-sense stationarity of the involved random processes is assumed. 

4.7.1 INTERFERENCE CANCELATION 

In interference cancelation, we have access to a mixture of two signals expressed as d n — y n + s« • Ide- 
ally, we would like to remove one of them, say, y n . We will consider them as realizations of respective 





FIGURE 4.9 


A basic block diagram illustrating the interference cancelation task. 
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random processes/signals, or d„, y„, and s„. To achieve this goal, the only available information is 
another signal, say, u„, that is statistically related to the unwanted signal, y„ . For example, y„ may 
be a filtered version of u„. This is illustrated in Fig. 4.9, where the corresponding realizations of the 
involved random processes are shown. 

Process y„ is the output of an unknown system H, whose input is excited by u„. The task is to 
model H by obtaining estimates of its impulse response (assuming that it is linear time-invariant and of 
known order). Then the output of the model will be an approximation of y„ when this is activated by the 
same input, u„. We will use d„ as the desired response process. The optimal estimates of wq, ..., u>/_i 
(assuming the order of the unknown system H to be /) are provided by the normal equations 

Tii«* = P 

However, 

P — E [u„d„] = E [u„ (y„ + s„)] 

= E[u„y„], (4.54) 

because the respective input vector u„ and s„ are considered statistically independent. That is, the pre- 
vious formulation of the problem leads to the same normal equations as when the desired response was 
the signal y n , which we want to remove! Hence, the output of our model will be an approximation (in 
the MSE sense), y„, of y„, and if subtracted from d„ the resulting (error) signal e„ will be an approx¬ 
imation to s„. FIow good this approximation is depends on whether / is a good “estimate” of the true 
order of H. The cross-correlation in the right-hand side of (4.54) can be approximated by computing 
the respective sample mean values, in particular over periods where s„ is absent. In practical systems, 
online/adaptive versions of this implementation are usually employed, as we will see in Chapter 5. 

Interference cancelation schemes have been widely used in many systems such as noise cancelation, 
echo cancelation in telephone networks, and video conferencing, and in biomedical applications, for 
example, in order to cancel the maternal interference in a fetal electrocardiograph. 

Fig. 4.10 illustrates the echo cancelation task in a video conference application. The same setup 
applies to the hands-free telephone Service in a car. The far-end speech signal is considered to be a 
realization u n of a random process u„; through the loudspeakers, it is broadcasted in room A (car) and 
it is reflected in the interior of the room. Part of it is absorbed and part of it enters the microphone; 
this is denoted as y n . The equivalent response of the room (reflections) on u„ can be represented by a 
filter, H, as in Fig. 4.9. Signal y n returns back and the speaker in location B listens to her or his own 
voice, together with the near-end speech signal, s n , of the speaker in A. In certain cases, this feedback 
path from the loudspeakers to the microphone can cause instabilities, giving rise to a “howling” sound 
effect. The goal of the echo canceler is to optimally remove y n . 

4.7.2 SYSTEM IDENTIFICATION 

System identification is similar in nature to the interference cancelation task. Note that in Fig. 4.9, one 
basically models the unknown system. However, the focus there was on replicating the output y„ and 
not on the systenYs impulse response. 

In system identification, the aim is to model the impulse response of an unknown piant. To this end, 
we have access to its input signal as well as to a noisy version of its output. The task is to design a 
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FIGURE 4.10 

The echo canceler is optimally designed to remove the part of the far-end signal, u „, that interferes with the near- 
end signal, s n . 


Vn 



FIGURE 4.11 

In system identification, the impulse response of the model is optimally estimated so that the output is close, in the 
MSE sense, to that of the unknown piant. The red line indicates that the error is used for the optimal estimation of 
the unknown parameters of the filter. 


model whose impulse response approximates that of the unknown piant. To achieve this, we optimally 
design a linear filter whose input is the same signal as the one that activates the piant and its desired 
response is the noisy output of the piant (see Fig. 4.1 1). The associated normal equations are 


r„u>* = E[u„d„] = E[u„y„] + 0, 


assuming the noise rp, is statistically independent of u„. Thus, once more, the resulting normal equa¬ 
tions are as if we had provided the model with a desired response equal to the noiseless output of 
the unknown piant, expressed as d„ = y n . Hence, the impulse response of the model is estimated so 
that its output is close, in the MSE sense, to the true (noiseless) output of the unknown piant. System 
identification is of major importance in a number of applications. In control, it is used for driving the as- 
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FIGURE 4.12 

The task of an equalizer is to optimally recover the originally transmitted information sequence s n , delayed by L 
time lags. 


sociated controllers. In data Communications, it is used for estimating the transmission channel in order 
to build up maximum likelihood estimators of the transmitted data. In many practical systems, adaptive 
versions of the system identification scheme are implemented, as we will discuss in following chapters. 

4.7.3 DECONVOLUTION: CHANNEL EQUALIZATION 

Note that in the cancelation task the goal was to “remove” the (filtered version of the) input signal 
(u„) to the unknown system H. In system identification, the focus was on the (unknown) system itself. 
In deconvolution, the emphasis is on the input of the unknown system. That is, our goal now is to 
recover, in the MSE optimal sense, a (delayed) input signal, d„ = s„_/ + i, where L is the delay in units 
of the sampling period T. The task is also called inverse system identification. The term equalization 
or channel equalization is used in Communications. The deconvolution task was introduced in the 
context of image deblurring in Section 4.6. There, the required information about the unknown input 
process was obtained via an approximation. In the current framework, this can be approached via the 
transmission of a training sequence. 

The goal of an equalizer is to recover the transmitted information symbols, by mitigating the so- 
called intersymbol interference (ISI) that any (imperfect) dispersive communication channel imposes 
on the transmitted signal; besides ISI, additive noise is also present in the transmitted information bits 
(see Example 4.2). Equalizers are “omnipresent” in these days; in our mobile phones, in our modems, 
etc. Fig. 4.12 presents the basic scheme for an equalizer. The equalizer is trained so that its output is as 
close as possible to the transmitted data bits delayed by some time lag L; the delay is used in order to 
account for the overall delay imposed by the channel equalizer system. Deconvolution/channel equal¬ 
ization is at the heart of a number of applications besides Communications, such as acoustics, optics, 
seismic signal processing, and control. The channel equalization task will also be discussed in the next 
chapter in the context of Online learning via the decision feedback equalization mode of operation. 

Example 4.1 (Noise cancelation). The noise cancelation application is illustrated in Fig. 4.13. The 
signal of interest is a realization of a process s„, which is contaminated by the noise process vi (n). 
For example, s„ may be the speech signal of the pilot in the cockpit and vi (n) the aircraft noise at the 
location of the microphone. We assume that vi (n) is an AR process of order one, expressed as 


v 1 (n) = a 1 vi(n- 1) + iq„. 
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FIGURE 4.13 

A block diagram for a noise canceler. The signals shown are realizations of the corresponding random variables that 
are used in the text. Using as desired response the contaminated signal, the output of the optimal filter is an estimate 
of the noise component. 


The random signal V 2 (n) is a noise sequcnce which is related to vi(«), but it is statistically inde- 
pendent of s„. For example, it may be the noise picked up from another microphone positioned at a 
nearby location. This is also assumed to be an AR process of the first order, 

V 2 («) =a 2 V 2 (n - l) + n„. 

Note that both vi(n) and V 2 (n) are generated by the same noise source, r\„, which is assumed to be 
white of variance cqy. For example, in an aircraft it can be assumed that the noise at different points is 
due to a “common” source, especially for nearby locations. 

The goal of the example is to compute estimates of the weights of the noise canceler, in order to 
optimally remove (in the MSE sense) the noise vi (n) from the mixture s„ + vi (n). Assume the canceler 
to be of order two. 

The input to the canceler is V 2 (n) and as desired response the mixture signal, d„ = s„ + vi (n), will 
be used. To establish the normal equations, we need to compute the covariance matrix, S 2 , of V 2 («) 
and the cross-correlation vector, p 2 , between the input random vector, V 2 («), and d„. 

Because V 2 («) is an AR process of the first order, recall from Section 2.4.4 that the autocorrelation 
sequence is given by 

(4.55) 


1 — 0-2 


1 — a 2 


n(k) = 


1 -aV 


Hence, 


£2 = 


>-2(0) >-2(1) 
>-2d) >-2(0) 


k = 0, 1, 


«20,; 


6 


We use the index n in parenthesis to unclutter notation due to the presence of a second subscript. 
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Next, we are going to compute the cross-correlation vector. We have 

Pii 0): =E[v 2 (n)d„] = E[v 2 (n) (s„ + vi(«))] 

= E[v 2 (n)vi(n)] + 0 = E[(a 2 v 2 (« - 1) + t|„) (aivi(n - 1) + r)„)] 

= A2fllP2(0) + (T,p 


or 

er,? 

PliO) = - - - -■ ( 4 . 56 ) 

1 — a 2 a i 

We used the fact that E[v 2 (n — 1 )r|„] = E[vi (n — l)r|„] = 0, because v 2 (7i — 1) and vi (n — 1) depend 
recursively on previous values, i.e., r|(n — 1), r|(n — 2),..., and also that is a white noise sequence, 
hence the respective correlation values are zero. Also, due to stationarity, E[v 2 (w)vi (n)] = E[v 2 (« — 

Dvi in - 1)]. 

For the other value of the cross-correlation vector we have 

l)d„]=E[v 2 (n - l)(s„ + vi(n))] 
l)vi(n)] + 0 = E [v 2 (n - 1) (aivi(w - 1) + ti„)] 

1 — a i« 2 


PiiD =E[v 2 (/i - 
= E[v 2 (« - 

= aip 2 i0) 


In general, it is easy to show that 


P2ik)= , , ^ = 0,1,.... (4.57) 

1 - a 2 a\ 

Recall that because the processes are real-valued, the covariance matrix is symmetric, meaning r 2 (k) = 
r 2 (— k). Also, for (4.55) to make sense (recall that r 2 (0) > 0), |a 2 | < 1. The same holds true for |ai|, 
following similar arguments for the autocorrelation process of V| ( n ). 

Thus, the optimal weights of the noise canceler are given by the following set of normal equations: 


1 

Q 

a 2^ 



Q 

k, 

_1 

1 ~ a l 

a 2 tf 

1 ~ a l 

w = 

1 

— Cl\d2 

fllCT 3 

- 1“«2 

1 _ 


_ 1 

-a\a 2 _ 


Note that the canceler optimally “removes” from the mixture, s„ + vi(n), the component that is corre- 
lated to the input, v 2 (n); observe that V| (n) basically acts as the desired response. 

Fig. 4.14A shows a realization of the signal cl n = s„ + i>i(n), where s n — cos(ojq n) with wq — 2 * 
10~ 3 * 7r., ai = 0.8, and = 0.05. Fig. 4.14B is the respective realization of the signal s„ + V[(n) — 
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FIGURE 4.14 

(A) The noisy sinusoid signal of Example 4.1. (B) The denoised signal for strongly correlated noise sources, vi and 
V 2 . (C) The obtained denoised signal for less correlated noise sources. 


d(«) for «2 = 0.75. The corresponding weights for the canceler are w t = [1,0.125] r . Fig. 4.14C cor- 
responds to an = 0.5. Observe that the higher the cross-correlation between vi (n) and V 2 in), the better 
the obtained resuit becomes. 

Example 4.2 (Channel equalization). Consider the channel equalization setup in Fig. 4.12, where the 
output of the channel, which is sensed by the receiver, is given by 

u„ = 0.5s„ + s„_ i + ri„. (4.58) 

The goal is to design an equalizer comprising three taps, w = [wq- «h, wi\ T , so tliat 
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(D) 


FIGURE 4.15 

(A) A realization of the information sequence comprising equiprobable, randomly generated, ± 1 samples of Ex- 
ample 4.2. (B) The received at the receiver-end corresponding sequence. (C) The sequence at the output of the 
equalizer for a low channel noise case. The original sequence is fully recovered with no errors. (D) The output 
of the equalizer for high channel noise. The samples in gray are in error and of opposite polarity compared to the 
originally transmitted samples. 


d„ = w 1 u„, 

and estimate the unknown taps using as a desired response sequence d„ = s„_i. We are given that 
E[s„] = E[r)„] = 0 and 

S s = <T“ I, Uq —0^1. 

Note that for the desired response we have used a delay L = 1. In order to better understand the 
reason that a delay is used and without going into many details (for the more experienced reader, note 
that the channel is nonminimum phase, e.g., [41]), observe that at time n, most of the contribution to 
u„ in (4.58) comes from the Symbol s„_i, which is weighted by one, while the sample s„ is weighted 
by 0.5; hence, it is most natural from an intuitive point of view, at time n, having received u„, to try to 
obtain an estimate for s„_i. This justifies the use of the delay. 

Fig. 4.15A shows a realization of the input information sequence s„. It consists of randomly gen¬ 
erated, equiprobable ±1 samples. The effect of the channel is (a) to combine successive information 
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samples together (ISI) and (b) to add noise; the purpose of the equalizer is to optimally remove both 
of them. Fig. 4.15B shows the respective realization sequence of u„, which is received at the receiver’s 
front end. Observe that, by looking at it, one cannot recognize in it the original sequence; the noise 
together with the ISI has really changed its “look.” 

Following a similar procedure as in the previous example, we obtain (Problem 4.14) 



1-250-i 2 + cr 2 

0.5o- 2 

0 


1 

i_ 

s u = 

0-5rr 2 

1.25er 2 + er 2 

0.5a 2 

. P = 

0.5a 2 


0 

0.5a 2 

1.25cr 2 + cr 2 _ 


0 


Solving the normal equations, 
for er 2 = 1 and er 2 = 0.01, results in 

A 11 


r„u;* = p , 


w * = [0.7462, 0.1195, -0.0474] r . 

Fig. 4.15C shows the recovered sequence by the equalizer (w#u„), after appropriate thresholding. It 
is exactly the same as the transmitted one: no errors. Fig. 4.15D shows the recovered sequence for 
the case where the variance of the noise was increased to cr 2 = 1. The corresponding MSE optimal 
equalizer is equal to 

ut* = [0.4132,0.1369, -0.0304] 7 ’. 

This time, the sequence reconstructed by the equalizer has errors with respect to the transmitted one 
(gray lines). 

A slightly alternative formulation for obtaining F u , instead of computing each one of its elements 
individually, is the following. Verify that the input vector to the equalizer (with tree taps) at time n is 
given by 


"0.5 1 0 0' 




r\n 

0 0.5 1 0 


1 

+ 

hrc—1 

0 0 0.5 1 


$n—2 


hn-2 



3_ 


_hn-3_ 


: = H s„ + T|„, ( 4 . 59 ) 

which results in 

r u = E [u„<] = Ho;H t + E n 
= ofHH T + o*I. 

The reader can easily verify that this is the same as before. Note, however, that (4.59) reminds us of 
the linear regression model. Moreover, note the special structure of the matrix H. Such matrices are 
also known as convolution matrices. This structure is imposed by the fact that the elements of u„ are 
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time-shifted versions of the first element, because the input vector corresponds to a random process. 
This is exactly the property that will be exploited next to derive efficient schemes for the solution of 
the normal equations. 


4.8 ALGORITHMIC ASPECTS: THE LEVINSON AND LATTICE-LADDER 
ALGORITHMS 

The goal of this section is to present algorithmic schemes for the efficient solution of the normal 
equations in (4.16). The filtering case where the input and output entities are random processes will 
be considered. In this case, we have already pointed out that the input covariance matrix has a special 
structure. The main concepts to be presented here have a generality that goes beyond the specific form 
of the normal equations. A vast literature concerning efficient (fast) algorithms for the least-squares 
task as well as a number of its online/adaptive versions have their roots to the schemes to be presented 
here. At the heart of all these schemes lies the specific structure of the input vector, whose elements 
are time-shifted versions of its first element, u„. 

Recall from linear algebra that in order to solve a general linear system of l equations with / un- 
knowns, one requires (?(/ 3 ) operations (multiplications and additions (MADS)). Exploiting the rich 
structure of the autocorrelation/covariance matrix, associated with random processes, an algorithm 
with 0(l 2 ) operations will be derived. The more general complex-valued case will be considered. 

The autocorrelation/covariance matrix (for zero mean processes) of the input random vector has 
been defined in (4.17). That is, it is Hermitian as well as semipositive definite. From now on, we 
will assume that it is positive definite. The covariance matrix in C mxm , associated with a complex 
wide-sense stationary process, is given by 


E m — 


r( 0) 
K-l) 


r(l) • ■ • r(m — 1) 

r (0) ■ ■ ■ r (m — 2) 


r(—m + 1) r(—m + 2) ••• r(0) 


r(0) r(l) • ■ ■ r(m — 1) 

r*(l) r( 0) r(m — 2) 


r*{m — 1) r*(m — 2) • ■ • r(0) 


where the property 

r(D := E[u„u*_;] = E [(u n -;u*)*] := r*(-i) 

has been used. We have relaxed the notational dependence of E on u and we have instead explicitly 
indicated the order of the matrix, because this will be a very useful index from now on. 

We will follow a recursive approach, and our aim will be to express the optimal filter solution of 
order m, denoted from now on as w m , in terms of the optimal one, w m -i, of order m — 1. 
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The covariance matrix of a wide-sense stationary process is a Toeplitz matrix; all the elements 
along any of its diagonals are equal. This property together with its Hermitian nature gives rise to the 
following nested structure: 


r m-\ J m-\ K0) 
K0) rl_~ 

r*m -1 Zm -1 ’ 


where 

' r( 1) - 

r( 2) 


_r(m — 1)_ 

and J m - \ is the antidiagonal matrix of dimension (m — 1) x (m — 1), defined as 


0 

0 

• 1 

0 

o ■ 

0 

0 

1 • 

0 

1 

0 

0 


(4.60) 

(4.61) 


(4.62) 


Note that right-multiplication of any matrix by J m -\ reverses the order of its columns, while multiply- 
ing it from the left reverses the order of the rows as follows: 

^m_i-f m _t = [r*(m — 1) r*(m — 2) r*(l)] t 

and 

= [r(m - 1) r(m- 2) r(l)] r . 

Applying the matrix inversion lemma from Appendix A. 1 for the upper partition in (4.60), we obtain 

J-[-r»._ x J n - t^-lr l]. (4.63) 

u m —1 

where for this case the so-called Schur complement is the scalar 


yl _ 


0 

m — 1 


o 7 o 


3m— 1 r m—1 


1 — ^ (0) r m—1 Jm—\^ m —\ Jm—\ r m—1 ■ 


( 4 . 64 ) 
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The cross-correlation vector of order p m , admits the following partition: 
E[u„d*] 


E[u„_ m+2 d*] 

|_E[u„_ m +id*]_ 

Combining (4.63) and (4.65), the following elegant relation results: 


Pm— 1 

Pm —1 


where p m _i := E[u„_ m+ id*]. 


w m := Z m l p m : 


Wm— 1 
0 


bm— 1 
1 


x m— 1 ' 


where 


and 


t^m—1 — ^ /n _ | P m — \ ’ b m —\ •— — — \ 


Pm—l ^ m _ \ Jm — 1 W m — 1 


— 

K m—1 ■— 


m— I 


(4.65) 


(4.66) 


(4.67) 


Eq. (4.66) is an order recursion that relates the optimal solution w,„ with w m -\. In order to obtain a 
complete recursive scheme, all one needs is a recursion for updating b m . 


FORWARD AND BACKWARD MSE OPTIMAL PREDICTORS 

Backward prediction: The vector b m — S~ l J m r m has an interesting physical interpretationi it is the 
MSE optimal backward predictor of order m. That is, it is the linear filter which optimally esti- 
mates/predicts the value of u„_,„ given the values of u„_ m +i, u„_ m + 2 , ■ ■ ■, u„. Thus, in order to design 
the optimal backward predictor of order m , the desired response must be 

dfl = n,; — m , 


and from the respective normal equations we get 

E[u„u*_J 


-l 


b m — £ m 


E[u„_ lU *_J 


— J l>l ^ m * 


_E[u„_ m+ iu*_ m ]_ 

Hence, the MSE optimal backward predictor coincides with b m , i.e., 


(4.68) 


b m = S m 1 J m r m : MSE optimal backward predictor. 
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FIGURE 4.16 

The impulse response of the backward predictor is the conjugate reverse of that of the forward predictor. 


Moreover, the corresponding minimum MSE, adapting (4.19) to our current needs, is equal to 

J(bm) = r(0) r m J m U m J m r m — a m . 

That is, the Schur complement in (4.64) is equal to the respective optimal MSE! 

Forward prediction: The goal of the forward prediction task is to predict the value u„+i, given the 
values u„, u„_i,..., u n _ m +i. Thus, the MSE optimal forward predictor of order m , a m , is obtained by 
selecting the desired response d„ = u (l+ 1 , and the respective normal equations become 



E[u„u* +| ] 


>*(!)" 

r-.— 1 

a m — 

E[u„_iu* +1 ] 


r*( 2) 


m+ 1 U n+ j ] 


_r*(m)_ 


a m — E m 1 r* n : MSE optimal forward predictor. 

From (4.70), it is not difficult to show (Problem 4.16) that (recall that J m J m = I m ) 

a m — J m b m > b m — J m a m 


(4.69) 


(4.70) 


(4.71) 


and that the optimal MSE for the forward prediction, J(a m ) := «/,, is equal to that for the backward 
prediction, i.e., 

J (a m ) = ai = a^ n = J(b m ). 

Fig. 4.16 depicts the two prediction tasks. In other words, the optimal forward predictor is the conjugate 
reverse of the backward one, so that 
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Qm (0) 


‘ b* n (m - 1) - 

a m •— 


= J m b* m := 



_a m (m - 1)_ 


- b* m { 0) _ 


This property is due to the stationarity of the involved process. Because the statistical properties only 
depend on the difference of the time instants, forward and backward predictions are not much different; 
in both cases, given a set of samples, 1 ,..., u n , we predict one sample ahead in the future (m„+i 

in the forward prediction) or one sample back in the past ( u n - m in the backward prediction). 

Having established the relationship between a m and b m in (4.71), we are ready to complete the 
missing step in (4.66); that is, to complete an order recursive step for the update of b m . Since (4.66) 
holds true for any desired response d„, it also applies for the special case where the optimal lilter to 
be designed is the forward predictor a m \ in this case, d„ = u„ + i. Replacing in (4.66) w m (w m -i) with 
a m (a m - 1 ) results in 




n ni 1 

0 


where (4.71) has been used and 


Jm— 1 ® 


* 

m — 1 


1 


bm —1 > 


( 4 . 72 ) 


r*(m) - J m —\a m —\ 


bm — 1 — 


m—1 


( 4 . 73 ) 


Combining (4.66), (4.67), (4.71), (4.72), and (4.73), the following algorithm, known as Levinson’s 
algorithm, for the solution of the normal equations results. 


Algorithm 4.1 (Levinson’s algorithm). 

• Input 

- r(0),r(l),...,r(D 


- Pk =E[u„_*d*], b — 0 , 1 ,...,/- 1 
Initialize 

- w l = 7TO’ «1 = TM’ «l = r (°) 


r(0) 


kw _ pi-r*(l)mi . _ r*(2)-r*(l)ai 


For m = 2,..., / — 1 , Do 


— 


W m -1 

0 

Q-m —1 
0 


_ J 

1 

-Jm-l a * m -\ 

1 


m— 1 


km —1 




k w — 


Pm 1" m Jm^m 
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, _ r*(m+l)-r%j m a m 

K m — „b 

w ra 

• End For 

Note that the update for a* is a direct consequence of its definitiori in (4.64) and (4.72) (Prob- 
lem 4.17). Also note that a* > 0 implies that \k m | < 1. 

Remarks 4.3. 

• The complexity per order recursion is 4 m MADS, hence for a system with / equations this amounts 
to 2 1 2 MADS. This computational saving is substantial compared to the 0(/ 3 ) MADS, required 
by adopting a general purpose scheme. The previous very elegant scheme was proposed in 1947 by 
Levinson [26]. A formulation of the algorithm was also independently proposed by Durbin [12]; the 
algorithm is sometimes called the Levinson-Durbin algorithm. In [ 1 1], it was shown that Levinson’s 
algorithm is redundant in its prediction part and the split Levinson algorithm was developed, whose 
recursions evolve around symmetric vector quantities leading to further computational savings. 

4.8.1 THE LATTICE-LADDER SCHEME 

So far, we have been involved with the so-called transversat implementation of a linear time-invariant 
FIR filter; in other words, the output is expressed as a convolution between the impulse response and 
the input of the linear structure. Levinson’s algorithm provided a computationally efficient scheme for 
obtaining the MSE optimal estimate w. t . We now turn our attention to an equivalent implementation of 
the corresponding linear filter, which comes as a direct consequence of Levinson’s algorithm. 

Detine the error signals associated with the /nth-order optimal forward and backward predictors at 


time instant n as 

efn(n) := u„ - a"u m (n - 1), (4.74) 

where u m (n) is the input random vector of the w;th-order filter, and the order of the filter has been 
explicitly brought into the notation. 7 The backward error is given by 

e?„(«): = u„- m - b"u m (n) 

= u- aj n J m u m (n). (4.75) 

Employing (4.75), the order recursion in (4.72), and the partitioning of u m (n ), which is represented by 
u,„(«) = [uf n _ 1 (n),u„- m+ i] T = [u„, uj n _ ] (n - l)] r , (4.76) 

in (4.74), we readily obtain 

efn(n) = e f m _ 1 (n)-e b m _ l (n-l)k* m _ v m = 1,2,(4.77) 
e m( n ) = e m-t(” — 1) — « = 1,2,...,/, (4.78) 


7 


The time index is now given in parentheses, to avoid having double subscripts. 
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with ep (n) = e^(n) = u„ and ko = This pair of recursions is known as lattice recursions. Let us 
focus a bit more on this set of equations. 

Orthogonality of the Optimal Backward Errors 

From the vector space interpretation of random signals, it is apparent that e* ( n ) lies in the subspace 
spanned by u„_ m ,..., u„, and we can write 

e* (n) e span{u(« — m ),..., u(n)}. 

Moreover, because & b m (n) is the error associated with the MSE optimal backward predictor, e* (n) _L 

span{u(n — m + 1),_u(«)}. However, the latter subspace is the one where Q b m _ k {n), k = 1,2 

lie. Hence, for m = 1,2,...,/— 1, we can write 


e* (n) _L e*(n), k < m : orthogonality of the backward errors. 


Moreover, it is obvious that 

spanjeQ(n), e* (n), ..., ef_j (n)} = span{u„, u„_i,..., u„_/+i}. 
Hence, the normalized vectors 


~C(n) := 


ll4(»)ll’ 


m — 0, 1,1 : orthonormal basis 


form an orthonormal basis in span{u„, u„-i,..., u„_;+i} (see Fig. 4.17). As a matter of fact, the pair 
of (4.77) and (4.78) comprises a Gram-Schmidt orthogonalizer [47]. 



FIGURE 4.17 


The optimal backward errors form an orthogonal basis in the respective input random signal space. 
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Let us now express d„, or the projection of d„ in span{u„,u„_/ + i}, in terms of the new set of 
orthogonal vectors. 


l -1 

d/i — ^ ' h m e m (n), 
m —0 

where the coefficients h m are given by 


(4.79) 


r e?» , m n &(n)] E[(d„-e„)e^(n)] 

"’l|ef„(«)|| 2)_ ||e,»|| 2 ||e*»|| 2 

_ E[d„e^*(«)] 

l|e*(”)ll 2 


(4.80) 


where the orthogonality of the error e„, with the subspace spanned by the backward errors, has been 
taken into account. From (4.67) and (4.80), and taking into account the respective dehnitions of the 
involved quanti des, we readily obtain 

h — k w * 

That is, the coefficients km = 0,1,1, in Levinson’s algorithm are the parameters in the 
expansion of d„ in terms of the orthogonal basis. Combining (4.77)-(4.79) the lattice-ladder scheme 
of Fig. 4.18 results, whose output is the MSE approximation d„ of d„. 


Remarks 4.4. 


• The lattice-ladder scheme is a highly efficient, modular structure. It comprises a sequence of suc¬ 
cessive similar stages. To increase the order of the filter, it suffices to add an extra stage, which 
is a highly desirable property in VLSI implementations. Moreover, lattice-ladder schemes enjoy a 
higher robustness, compared to Levinson’s algorithm, with respect to numerical inaccuracies. 


e o( n ) e i( n ) ef-i(n) 



FIGURE 4.18 


The lattice-ladder structure. In contrast to the transversal implementation in ternis of w m , the parameterization 

r*(l) P* 

is now in terms of k m , k™, m = 0. 1. I — \ , kn = -, and kf = —2-. Note the resultins highly modular 

r(0) 0 r(0) 


structure. 























4.8 ALGORITHMIC ASPECTS 157 


Cholesky factorization. The orthogonality property of the optimal MSE backward errors leads to 
another interpretation of the involved parameters. From the dehnition in (4.75), we get 


where 


and 



e o (w) 


U n 

e?(n) := 

e? (n) 

= U H 

U/7 — 1 


1 

D 

_1 


_^n— /+1_ 


= U H Ui(n), 


U H := 


1 

—ai(O) 


0 0 ••• 0 ' 

1 0 ••• 0 

: : : 0 

1 ) — a /(1 — 2 ) . 1 


Cl m — [a m (0); a m (l)i • - • ■ (hn (m 1)] , ni — 1,2,...,/. 


Due to the orthogonality of the involved backward errors, 

E[ef(n)ef H (n)] = U H £,U = D, 


( 4 . 81 ) 


where 

D := diag| o!q,} , 
or 

X” 1 = UD~ x U h = (UD~ 1/2 )(UD~ 1/2 ) H . 

That is, the prediction error powers and the weights of the optimal forward predictor provide the 
Cholesky factorization of the inverse covariance matrix. 

• The Schur algorithm. In a parallel processing environment, the inner products involved in Levin- 
son’s algorithm pose a bottleneck in the flow of the algorithm. Note that the updates for w m and a m 
can be performed fully in parallel. Schur’s algorithm [45] is an alternative scheme that overcomes 
the bottleneck, and in a multiprocessor environment the complexity can go down to 0(1). The pa¬ 
rameters involved in Schur’s algorithm perform a Cholesky factorization of X/ (e.g., [21,22]). 

• Note that ali these algorithmic schemes for the efficient solution of the normal equations owe their 
existence to the rich structure that the (autocorrelation) covariance matrix and the cross-correlation 
vector acquire when the involved jointly distributed random entities are random processes; their 
time sequential nature imposes such a structure. The derivation of the Levinson and lattice-ladder 
schemes reveal the flavor of the type of techniques that can be (and have extensively been) used to 
derive computational schemes for the online/adaptive versions and the related least-squares error 
loss function, to be discussed in Chapter 6. There, the algorithms may be computationally more 
involved, but the essence behind them is the same as for those used in the current section. 
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4.9 MEAN-SQUARE ERROR ESTIMATION OF LINEAR MODELS 

We now tum our attention to the case where the underlying model that relates the input-output variables 
is a linear one. To prevent confusion with what was treated in the previous sections, it must be stressed 
that, so far, we have been concerned with the linear estimation task. At no point in this stage of our 
discussion has the generation model of the data been brought in (with the exception in the comment of 
Remarks 4.2). We just adopted a linear estimator and obtained the MSE solution for it. The focus was 
on the solution and its properties. The emphasis here is on cases where the input-output variables are 
related via a linear data generation model. 

Let us assume that we are given two jointly distributed random vectors, y and 0, which are related 
according to the following linear model, 


y = Z0 + r), (4.82) 

where r| denotes the set of the involved noise variables. Note that such a model covers the case of 
our familiar regression task, where the unknown parameters 0 are considered random, which is in line 
with the Bayesian philosophy, as discussed in Chapter 3. Once more, we assume zero mean vectors; 
otherwise, the respective mean values are subtracted. The dimensions of y (r)) and 0 may not necessarily 
be the same; to be in line with the notation used in Chapter 3, let y, r) e R' v and 0 e R 7 . Hence, X is an 
N x l matrix. Note that the matrix X is considered to be deterministic and not random. 

Assume the covariance matrices of our zero mean variables, 

Xq — E[00 r ], r„=E[r|Ti 7 '], 

are known. The goal is to compute a matrix H of dimension / x N so that the linear estimator 


0 = Hy 


(4.83) 


minimizes the MSE cost 

l 

= ^E[(0 I --0 I -) 2 ]. (4.84) 

;=t 

Note that this is a multichannel estimation task and it is equivalent to solving / optimization tasks, one 
for each component, 9,, of 0. Defining the error vector as 


J(H) :=e[(0-0) 7 ’(0-0)] 


€ := 0 — e, 

the cost function is equal to the trace of the corresponding error covariance matrix , so that 

J(H) := trace |e [«€ 7 ] J . 

Focusing on the /th component in (4.83), we write 

0; = hjy, i = 1,2,... ,1, 


(4.85) 
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where h j is the ith row of H and its optimal estimate is given by 

h*j := argimnE^O,- -0/) 2 ] = e[(0,- - hf y) 2 ] . (4.86) 

Minimizing (4.86) is exactly the same task as that of the linear estimation considered in the previous 
section (with y in place of x and 0, in place of y); hence. 


Z y h*j = p i , i = 1,2,...,/, 


where 

E y = E[yy r ] and p, 

= E[y0/ ], i = 1,2,..,/, 


or 

,T _ Jr-1 

”*,/ Pi ^y ’ 

i = 1,2,...,/, 


and finally, 

£ 

II 

V 

qb 

h 

^ l 

1 

II 

(4.87) 

where 





= E[0y r ] (4.88) 

is an / x N cross-correlation matrix. All that is now required is to compute E y and E y g. To this end, 



r, = E [yy r ] = E [(X0 + r,) (% T X T + q r ) 
= XE e X T + E n , 


(4.89) 


where the independence of the zero mean vectors 0 and rj has been used. Similarly, 
E y g = E [0y r ] = E [0 (0 r X r + T) r )] = EgX T , 
and combining (4.87), (4.89), and (4.90), we obtain 


)=EgX' (En + XEeX 7 } 


-1 


y- 


Employing from Appendix A. 1 the matrix identity 


(a -1 + b t c~ 1 b) 


-1 


B 1 C~' B ) B T C~ l —AB 1 [BAB 1 +C 


( BAB t + C ) 1 


(4.90) 


(4.91) 
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in (4.91) we obtain 


0 = 1 + X 1 I X) l X J E tl L y : MSE linear estimator. 


(4.92) 


In case of complex-valued variables, the only difference is that transposition is replaced by Hermitian 
transposition. 

Remarks 4.5. 

• Recall from Chapter 3 that the optimal MSE estimator of 0 given the values of y is provided by 


E[0|y]. 


However, as shown in Problem 3.16, if 0 and y are jointly Gaussian vectors, then the optimal esti¬ 
mator is linear (affine for nonzero mean variables), and it coincides with the MSE linear estimator 
of (4.92). 

• If we allow nonzero mean values, then instead of (4.83) the affine model should be adopted, 


0 = Hy+^. 


Then 


E[0] = H E[y] +/i =$■ ii = E[0] - H E[y]. 

Hence, 

0 = E[0] + H(y — E[y]), 

and hnally, 

0 — E[0] = H(y — E[y]), 

which justihes our approach to subtract the means and work with zero mean value variables. For 
nonzero mean values, the analogue of (4.92) is 

0 = E[0] + (s~ l + X T S- 1 xy 1 X T S ~ 1 (y - E[y]). (4.93) 


Note that for zero mean noise r), E[y] = X E[0]. 

• Compare (4.93) with (3.73) for the Bayesian inference approach. They are identical, provided that 
the covariance matrix of the prior (Gaussian) PDF is equal to Eg and 6 o — E[0] for a zero mean 
noise variable. 


4.9.1 THE GAUSS-MARKOV THEOREM 

We now turn our attention to the case where 6 in the regression model is considered to be an (unknown) 
constant, instead of a random vector. Thus, the linear model is now written as 


y=X0 + T), 


( 4 . 94 ) 
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and the randomness of y is solely due to T), which is assumed to be zero mean with covariance matrix 
E rj . The goal is to design an unbiased linear estimator of 0, which minimizes the MSE, 


0 = H y, (4.95) 

and select H such that 

minimize trace J E [(0 — 0)(0 — 0) r ] J 

subjectto E[0]=0. (4.96) 

From (4.94) and (4.95), we get 

E[0] = H E[y] = HE[(X6 + r))] = HX 0, 
which implies that the unbiased constraint is equivalent to 

HX = I. (4.97) 

Employing (4.95), the error vector becomes 

€ = 0 — 0 = 0 — Hy = 0 — H(X0 + r)) = -Hx\. (4.98) 

Hence, the constrained minimization in (4.96) can now be written as 

//* = arg min trace {H H r }. 

H 

s.t. HX = I. (4.99) 


Solving (4.99) results in (Problem 4.18) 

H,. = (X T Z- ] X)- ] X T Z-\ (4.100) 

and the associated minimum MSE is 

/(H*) := MSE (i/*) = trace j (X T Z~ l X)~ l J . (4.101) 


The reader can verify that 


J(H) > /(//*) 


for any other linear unbiased estimator (Problem 4.19). 

The previous resuit is known as the Gauss-Markov theorem. The optimal MSE linear unbiased 
estimator is given by 


0= (X T Z~ l X)- l X T Z~ l y: BLUE. 


(4.102) 


It is also known as the best linear unbiased estimator (BLUE), or the minimum variance unbiased 
linear estimator. For complex-valued variables, the transposition is simply replaced by the Hermitian 


one. 
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Remarks 4.6. 

• For the BLUE to exist, X T Z~ 1 X must be invertible. This is guaranteed if is positive definite 
and the IV x / matrix X, N > /, is full rank (Problem 4.20). 

• Observe that the BLUE coincides with the maximum likelihood estimator (Chapter 3) if r| follows 
a multivariate Gaussian distribution; recall that under this assumption, the Cramer-Rao bound is 
achieved. If this is not the case, there may be another unbiased estimator (nonlinear), which results 
in lower MSE. Recall also from Chapter 3 that there may be a biased estimator that results in lower 
MSE; see [13,38] and the references therein for a related discussion. 

Example 4.3 (Channel identification). The task is illustrated in Fig. 4.11. Assume that we have access 
to a set of input-output observations, u n and d„ , n — 0. 1,2,..., /V — 1. Moreover, we are given that the 
impulse response of the system comprises l taps, it is zero mean, and its covariance matrix is Z w . Also, 
the second-order statistics of the zero mean noise are known, and we are given its covariance matrix, 
E,]. Then, assuming that the piant starts from zero initial condidons, we can adopt the following model 
relating the involved random variables (in line with the model in (4.82)): 


' d 0 ' 




fio 

di 


w 0 


m 

d/— i 

= u 

Wl 

+ 

h/-i 






1 

7 

T3 




N-l_ 


(4.103) 


where 


«0 0 0 • ■ • 0 

«i «o 0 ••• 0 


U := 


M/-1 M/_ 2 



JiN-l . UN-1_ 

Note that U is treated deterministically. Then, recalling (4.92) and plugging in the set of obtained 
measurements, the following estimate results: 


w = {Z~ x + U T Z~ l U)U T E~ l d. 


(4.104) 


4.9.2 C0NSTRAINED LINEAR ESTIMATION: THE BEAMF0RMING CASE 

We have already dealt with a constrained linear estimation task in Section 4.9.1 in our effort to obtain 
an unbiased estimator of a fixed-value parameter vector. In the current section, we will see that the pro- 
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s(t) 



FIGURE 4.19 

The task of the beamformer is to obtain estimates of the weights wo,, w/-i, so as to minimize the effect of noise 
and at the same time impose a constraint that, in the absence of noise, would leave signals impinging the array from 
the desired angle, (j>, unaffected. 


cedure developed there is readily applicable for cases where the unknown parameter vector is required 
to respect certain linear constraints. 

We will demonstrate such a constrained task in the context of beamforming. Fig. 4.19 illustrates the 
basic block diagram of the beamforming task. A beamformer comprises a set of antenna elements. We 
consider the case where the antenna elements are uniformly spaced along a straight line. The goal is to 
linearly combine the signals received by the individual antenna elements, so as to: 

• turn the main beam of the array to a specific direction in space, and 

• optimally reduce the noise. 

The first goal imposes a constraint to the designer, which will guarantee that the gain of the array is 
high for the specific desired direction; for the second goal, we will adopt MSE arguments. 

In a more formal way, assume that the transmitter is far enough away, so as to guarantee that the 
wavefronts that the array “sees” are planar. Let s(r) be the information random process transmitted at 
a carrier frequency, co c . Then the modulated signal is 

T(t) = s(t)e je0ct . 


If A.r is the distance between successive elements of the array, then a wavefront that arrives at time to 
at the first element will reach the ;th element delayed by 


Ax cos <p 
A (j =t{-to = i - 


i = 0 , 1 ,...,/- 1 , 
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where c is the speed of propagation, <p is the angle formed by the array and the direction propagation 
of the wavefronts, and / is the number of array elements. We know from our basic electromagnetic 
courses that 


c = 


co c k 

2jt 


where k is the respective wavelength. Taking a snapshot at time ?, the signal received from direction </; 
at the i th element will be 


/. • 2nAx cos 

r,(?) = s(? - A 
— s(?)e 7 


0 juct e ~2nj 


i Ax cos </> 
X 


*-) 

1=0,1,...,/—!, 


where we have assumed a relatively low time signal variation. After converting the received signals in 
the baseband (multiplying by e~ J< ° ct ), the vector of the received signals (one per array element) at time 
? can be written in the following linear regression-type formulation: 


u (t) := 


uo(0 

ui(r) 


= XS(t) + T](f), 


Lu/_t(i)J 


(4.105) 


where 


x 


1 

Ax cos (p 

—2 n i - 

e ^ 


—2itj - 


(/ — l)Axcos</> 


and the vector r) (?) contains the additive noise plus any other interference due to signals coming from 
directions other than <f>, so that 

ti(0 = [Tio(0,---,'n/-i(f)] r , 

and it is assumed to be of zero mean; Jt: is also known as the steering vector. The output of the beam- 
former, acting on the input vector signal, will be 


s (?) = w H u(t), 


where the Hermitian transposition has to be used, because now the involved signals are complex- 
valued. 

We will first impose the constraint. Ideally, in the absence of noise, one would like to recover signals 
that impinge on the array from the desired direction, (p , exactly. Thus, w should satisfy the following 
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constraint: 

w H jc = 1, (4.106) 

which guarantees that s(f) = s(f ) in the absence of noise. Note that (4.106) is an instance of (4.97) if 
we consider w H and x in place of H and X, respectively. To account for the noise, we require the MSE 


E |s(?) - s(r)|- 


= E[|s(r) • 


w H u{t)\ 2 


to be minimized. However, 

s(0 — W H U(t) — s(f) — W H (jCS(f) + T|(t)) = 
Hence, the optimal w. t results from the following constrained task: 

10 * argmin (w H X tl w) 




s.t. w H x=l, (4.107) 

which is an instance of (4.99) and the solution is given by (4.100); adapting it to the current notation 
and to its complex-valued formulation, we get 


„H 


xH ' x 


(4.108) 


and 


The minimum MSE is equal to 


s (0 = «tf u (t) = 


x H Z~ l u (Q 
jt H ' jtr 


MSE(iu*) 


1 

x H ^ x 


(4.109) 


(4.110) 


An alternative formulation for the cost function in order to estimate the weights of the beamformer, 
which is often met in practice, builds upon the goal to minimize the output power, subject to the same 
constraint as before, 


w* := argminE [| | 2 ], 

s.t. w H x — 1, 


or equivalently 


10 * := argmin w H X u w, 


w H x= 1. 


s.t. 


(4.111) 
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Angle 


FIGURE 4.20 

The amplitude beam pattern, in dBs, as a function of the angle <f> with respect to the planar array. 


This time, the beamformer is pushed to reduce its output signal, which, due to the presence of the 
constraint, is equivalent to optimally minimizing the contributions originating from the noise as well 
as from all other interference sources impinging on the array from different, to </>, directions. The 
resulting solution of (4. 11 1) is obviously the same as (4. 109) and (4. 110) if one replaces S tl with S u . 

This type of linearly constraint task is known as linearly constrained minimum variance (LMV) 
or Capon beamforming or minimum variance distortionless response (MVDR) beamforming. For a 
concise introduction to beamforming, see, e.g., [48]. 

Widely linear versions for the beamforming task have also been proposed (e.g., [10,32]) (Prob- 
lem 4.21). 

Fig. 4.20 shows the resulting beam pattern as a function of the angle cp. The desired angle for 
designing the optimal set of weights in (4.108) is cf> — n. The number of antenna elements is I = 10, 
the spacing has been chosen as At = 0.5, and the noise covariance matrix is chosen as 0.1/. The beam 
pattern amplitude is in dBs, meaning the vertical axis shows 201og 10 (|u>^jc(</>)|). Thus, any signal 
arriving from directions r/j not close to <j> = jr will be absorbed. The main beam can become sharper if 
more elements are used. 


4.10 TIME-VARYING STATISTICS: KALMAN FILTERING 

So far, our discussion about the linear estimation task has been limited to stationary environments, 
where the statistical properties of the involved random variables are assumed to be invariant with time. 
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However, very often in practice this is not the case, and the statistical properties may be different at 
different time instants. As a matter of fact, a large effort in the subsequent chapters will be devoted to 
studying the estimation task under time-varying environments. 

Rudolf Kalman is the third scientist, the other two being Wiener and Kolmogorov, whose significant 
contributions laid the foundations of estimation theory. Kalman was Hungarian-born and emigrated to 
the United States. He is the father of what is today known as system theory based on the state-space 
formulation, as opposed to the more limited input-output description of systems. 

In 1960, in two seminal papers, Kalman proposed the celebrated Kalman filter, which exploits the 
state-space formulation in order to accommodate in an elegant way time-varying dynamics [18,19]. 
We will derive the basic recursions of the Kalman filter in the general context of two jointly distributed 
random vectors y, x. The task is to estimate the values of x given observations of y. Let y and x be 
linearly related via the following set of recursions: 


(4.112) 

(4.113) 

where , x„ e M / , v„, y„ e R A . The vector x„ is known as the state of the system at time n and y„ 
is the output, which is the vector that can be observed (measured); t\ n and v„ are the noise vectors, 
known as process noise and measurement noise, respectively. Matrices F n and H„ are of appropriate 
dimensions and they are assumed to be known. Observe that the so-called state equation provides the 
information related to the time-varying dynamics of the corresponding system. It turns out that a large 
number of real-world tasks can be brought into the form of (4. 112) and (4. 113). The model is known 
as the state-space model for y„. In order to derive the time-varying estimator, x„, given the measured 
values of y„, the following assumptions will be adopted: 

* E[n„i \l]=Q n , E[n„r )J n ]=0, n j=- m, 

* E[v„v^] = R n , E[v„v^] = O , n ± m, 

• E[r)„v^] = O, 'in, m, 

* E[r)„] = E[v„] = 0, V«, 

where O denotes a matrix with zero elements. That is, T)„, v„ are uncorrelated; moreover, noise vectors 
at different time instants are also considered uncorrelated. Versions where some of these conditions are 
relaxed are also available. The respective covariance matrices, Q n and R n , are assumed to be known. 

The development of the time-varying estimation task evolves around two types of estimators for 
the state variables: 

• The first one is denoted as 

t^n\n — 1 5 

and it is based on all information that has been received up to and including time instant n — 1; in 
other words, the obtained observations of yo, yi,..., y„_i. This is known as the a priori or prior 
estimator. 

• The second estimator at time n is known as the posterior one, it is denoted as 

ttn\n > 


X„ = F n x n -1 +T 

n > 0 

state equation, 

y n = X/7 ”h Vft, 

n > 0: 

output equation, 
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and it is computed by updating x n \ n -\ after y„ has been observed. 

For the development of the algorithm, assume that at time n — 1 ali required information is avail- 
able; that is, the value of the posterior estimator as well the respective error covariance matrix 

— \\n— 1) Pn—\\n—\ - = E [c tt — l\n— lC ;I _i| n _i], 

where 


— 1 |/z — 1 -— 

Step 1: Using x„_i|„_i, predict x„|„_i using the state equation; that is, 


Xh|h— 1 — F n X n — \\ n — l. 


(4.114) 


In other words, ignore the contribution from the noise. This is natural, because prediction cannot in¬ 
volve the unobserved variables. 

Step 2: Obtain the respective error covariance matrix, 

Pn\n — 1 = E [(X n X; ; | h _i)(x /? X n \ n —\) J. (4.115) 


However, 


^n\n— 1 •—X/; X /; j, ; _ | — F n X n _ ] l] /7 FnXfi—X\n—l 

- F„e„_i|„_i +r)„. (4.116) 

Combining (4.115) and (4.116), it is straightforward to see that 

P n \n-\ = F n P n -\\ n -\F„ + Q n . (4.117) 

Step 3: Update x„|„_i. To this end, adopt the following recursion: 

^n\n -}- K n £ n , (4.118) 

where 

- = yn F^ti^-n\n— 1- (4.119) 

This time update recursion, once the observations for v„ have been received, has a form that we will 
meet over and over again in this book. The "new” (posterior) estimate is equal to the “old” (prior) one, 
which is based on the past history, plus a correction term; the latter is proportional to the error e„ in 
predicting the newly arrived observations vector and its prediction based on the “old” estimate. Matrix 
K n , known as the Kalman gain, Controls the amount of correction and its value is computed so as to 
minimize the MSE; in other words, 

J(K„) :=E[e,y|„e„|„] = trace {P„|„}, (4.120) 


where 


Pn\n — E [e„|„e ;( | ;( ] 


(4.121) 
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and 


|/z -— X n X n \n . 

It can be shown that the optimal Kalman gain is equal to (Problem 4.22) 

K n = P n]n -iH^S~\ 


(4.122) 


where 


Sn — Rn + H n Pn\n-\Hn ■ 


(4.123) 


Step 4\ The final recursion that is now needed in order to complete the scheme is that for the update 
of P n |„. Combining the definitions in (4.119) and (4.121) with (4.118), we obtain the following resuit 
(Problem 4.23): 


(4.124) 



The algorithm has now been derived. All that is now needed is to select the initial conditions, which 
are chosen such that 


ilio = E[xi] 


(4.125) 



(4.126) 


for some initial guess Ilo. The Kalman algorithm is summarized in Algorithm 4.2. 

Algorithm 4.2 (Kalman filtering). 

• Input: F n , H n , Q n , R n , y n , n = 1,2,... 

• Initialization: 

- £i| 0 = E[xi] 

- Pi | 0 = n 0 

• For n = l,2,..., Do 

- S n = R n + H n P n \ n -i 



• End For 

For complex-valued variables, transposition is replaced by the Hermitian operation. 

Remarks 4.7. 

• Besides the previously derived basic scheme, there are a number of variants. Although, in theory, 
they are all equivalent, their practical implementation may lead to different performance. Observe 
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that P n |„ is computed as the difference of two positive definite matrices; this may lead to a P n \ n that 
is not positive definite, due to numerical errors. This can cause the algorithm to diverge. A popular 
alternative is the so-called Information filtering scheme, which propagates the inverse state-error 
covariance matrices, P~^ and P~^_ l [20]. In contrast, the scheme in Algorithm 4.2 is known as the 
covariance Kalman algorithm (Problem 4.24). 

To cope with the numerical stability issues, a family of algorithms propagates the factors of P n |„ 
(or P~\l), resulting from the respective Cholesky factorization [5,40]. 

• There are different approaches to arrive at the Kalman filtering recursions. An alternative derivation 
is based on the orthogonality principle applied to the so-called innovations process associated with 
the observation sequence, so that 

€(n) — y n i - 

where y n \i:n-\ is the prediction based on the past observations history [17]. In Chapter 17, we are 
going to rederive the Kalman recursions looking at it as a Bayesian network. 

• Kalman filtering is a generalization of the optimal MSE linear filtering. It can be shown that when 
the involved processes are stationary, Kalman filter converges in its steady state to our familiar 
normal equations [31]. 

• Extended Kalman filters. In (4.112) and (4.113), both the state and the output equations have a linear 
dependence on the state vector x„. Kalman filtering, in a more general formulation, can be cast as 

x„ = /„(x„-i) + il„, 

y„ = /i„(x„) + v„, 


where f n and h n are nonlinear vector functions. In the extended Kalman filtering (EKF), the idea 
is to linearize the functions h n and /„, at each time instant, via their Taylor series expansions, and 
keep the linear term only, so that 


F n 


dfn(Xn) 

dx n 


X n — X n —\\ n —\ ’ 


H n 


Bh n ( x n ) 

3 X n 


and then proceed by using the updates derived for the linear case. 

By its very definition, the EKF is suboptimal and often in practice one may face divergence of 
the algorithm; in general, it must be stated that its practical implementation needs to be carried out 
with care. Having said that, it must be pointed out that it is heavily used in a number of practical 
systems. 

Unscented Kalman filters represent an alternative way to cope with the nonlinearity, and the 
main idea springs forth from probabilistic arguments. A set of points are deterministically selected 
from a Gaussian approximation of p(x n \y\, these points are propagated through the non- 

linearities, and estimates of the mean values and covariances are obtained [15]. Particle filtering, 
to be discussed in Chapter 17, is another powerful and popular approach to deal with nonlinear 
state-space models via probabilistic arguments. 
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More recently, extensions of Kalman filtering in reproducing kernel Hilbert spaces offers an 
alternative approach to deal with nonlinearities [52]. 

A number of Kalman filtering versions for distributed learning (Chapter 5) have appeared in, e.g., 
[9,23,33,43]. In the latter of the references, subspace learning methods are utilized in the prediction 
stage associated with the state variables. 

• The literature on Kalman filtering is huge, especially when applications are concerned. The inter- 
ested reader may consuit more specialized texts, for example [4,8,17] and the references therein. 

Example 4.4 (Autoregressive process estimation). Let us consider an AR process (Chapter 2) of or- 
der /, represented as 

l 

x„ = -^«iX„-i +T)n, (4.127) 

(=1 

where is a white noise sequence of variance Our task is to obtain an estimate x n of x„, having 
observed a noisy version of it, y„. The corresponding random variables are related as 


y« = x„ + v„. 


(4.128) 


To this end, the Kalman filtering formulation will be used. Note that the MSE linear estimation, 
presented in Section 4.9, cannot be used here. As we have already discussed in Chapter 2, an AR 
process is asymptotically stationary; for finite-time samples, the initial conditions at time n = 0 are 
“remembered” by the process and the respective second-order statistics are time dependent, hence it is 
a nonstationary process. However, Kalman filtering is specially suited for such cases. 

Let us rewrite (4.127) and (4.128) as 


x« 

N?—1 
X;; —2 


_x„-/+l_ 


y n 






1 


r 

—a\ 

-«2 •• 

• —ai- 1 

—ai 


x w -2 


r\ 

1 

0 

■ 0 

0 


X/i—3 

_|_ 

u 

0 

1 

• 0 

0 



i 


0 

0 •• 

1 

0 




o 





_Xft— l _ 




[i o ... 0] 




+ v w 


_^n—l+ 1_ 


or 


X„ = Fx n -1 + T), 
y n — Flx n T" V„, 


where the dehnitions of F n := F and //„ := // are obvious and 


Qn = 


0 

0 

0 


Rn = 


(4.129) 

(4.130) 


(scalar). 
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FIGURE 4.21 

(A) A realization of the observation sequence, y n , which is used by the Kalman filter to obtain the predictions of the 
state variable. (B) The AR process (state variable) in red together with the predicted by the Kalman filter sequence 
(gray), for Example 4.4. The Kalman filter has removed the effect of the noise v„. 


Fig. 4.21 A shows the values of a specific realization y n , and Fig. 4.2 1B shows the corresponding 
realization of the AR(2) (red) together with the predicted Kalman filter sequence x n . Observe that the 
match is very good. For the generation of the AR process we used / = 2,a\— 0.95, a .2 — 0.9, = 0.5. 

For the Kalman filter output noise, = 1. 


PROBLEMS 

4.1 Show that the set of equations 

E0 = p 


has a unique solution if E > 0 and infinitely many if E is singular. 
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4.2 Show that the set of equations 

Z6 = p 

always has a solution. 

4.3 Show that the shape of the isovalue contours of the graph of J(6) (MSE), 

J(0) = /(0*) + (0 - 0^) T S(0 - 0*), 

are ellipses whose axes depend on the eigenstructure of S. 

Hint: Assume that L has discrete eigenvalues. 

4.4 Prove that if the true relation between the input x and the true output y is linear, meaning 

y = 0jx + v„, 0„ e R ( , 

where v is independent of x, then the optimal MSE estimate 0* satisfies 

0 *= 0 O . 

4.5 Show that if 

y = 0j x + v, 0„eR k , 

where v is independent of x, then the optimal MSE 0* e R*, 1 < k, is equal to the top / compo- 
nents of 0 O if the components of x are uncorrelated. 

4.6 Derive the normal equations by minimizing the cost in (4.15). 

Hint: Express the cost 0 in terms of the real part 0, and its imaginary part 0/, and optimize with 
respect to 0 , 0,. 

4.7 Consider the multichannel filtering task 


y/ 

= © 

x r 

ji. 


X/. 


Estimate 0 so as to minimize the error norm 

E[||y-y|| 2 ]. 

4.8 Show that (4.34) is the same as (4.25). 

4.9 Show that the MSE achieved by a linear complex-valued estimator is always larger than that 
obtained by a widely linear one. Equality is achieved only under the circularity conditions. 

4.10 Show that under the second-order circularity assumption, the conditions in (4.39) hold true. 

4.11 Show that if 

/ : C —* R, 

then the Cauchy-Riemann conditions are violated. 

4.12 Derive the optimality condition in (4.45). 

4.13 Show Eqs. (4.50) and (4.51). 

4.14 Derive the normal equations for Example 4.2. 
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4.15 The input to the channel is a white noise sequence s„ of variance af. The output of the channel 
is the AR process 

y« = «iy„—i + s„. (4.131) 

The channel also adds white noise r|„ of variance a~. Design an optimal equalizer of order 
two, which at its output recovers an approximation of s„_/ . Sometimes, this equalization task is 
also known as whitening, because in this case the action of the equalizer is to “whiten” the AR 
process. 

4.16 Show that the forward and backward MSE optimal predictors are conjugate reverse of each other. 

4.17 Show that the MSE prediction errors (a £ — a^) are updated according to the recursion 

a m = a m-t(l - k»i-il 2 )- 


4.18 Derive the BLUE for the Gauss-Markov theorem. 

4.19 Show that the MSE (which in this case coincides with the variance of the estimator) of any linear 
unbiased estimator is higher than that associated with the BLUE. 

4.20 Show that if F rl is positive definite, then X T E~ 1 X is also positive definite if X is full rank. 

4.21 Derive an MSE optimal linearly constrained widely linear beamformer. 

4.22 Prove that the Kalman gain that minimizes the error covariance matrix 

Pn\n — Ei [(x /; ) (X„ X H | W ) J 

is given by 

X ,■ — P n \ n -\H n (R n + H n Pn\n 
Hint: Use the following formulas: 

3trace{AZ?} T 

-i-- = B {AB a square matrix), 

T \ 

-—-- = 2AC, (C = C T ). 

oA 

4.23 Show that in Kalman filtering, the prior and posterior error covariance matrices are related as 


dA 

3 trace lACA 


-tff-T 1 - 


Pn\n — P>!\n — I K n H n P n \ n —\. 

4.24 Derive the Kalman algorithm in terms of the inverse state-error covariance matrices, P~^- In 
statistics, the inverse error covariance matrix is related to Fisher’s information matrix; hence, the 
name of the scheme. 


MATLAB® EXERCISES 

4.25 Consider the image deblurring task described in Section 4.6. 
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• Download the “boat” image from Waterloo’s Image repository. 8 Alternatively, you may use 
any grayscale image of your choice. You can load the image into MATLAB®’s memory using 
the “imread” function (also, you may want to apply the function “im2double” to get an array 
consisting of doubles). 

• Create a blurring point spread function (PSF) using MATLAB®’s command “fspecial.” For 
example, you can write 

PSF = fspecial('motion20,45); 


The blurring effect is produced using the “imfilter” function 

J = imfi 1 ter (I , PSF , ’ conv ’, 'circ'); 


where I is the original image. 

• Add some white Gaussian noise to the image using MATLAB®’s function “imnoise,” as 
follows: 


J = imnoise(J, 'gaussian', noise_mean, noise_var); 


Use a small value of noise variance, such as 10 -6 . 

• To perform the deblurring, you need to employ the “deconvwnr” function. For example, if 
J is the array that contains the blurred image (with the noise) and PSF is the point spread 
function that produced the blurring, then the command 

K = deconvwnr(J, PSF, C); 


returns the deblurred image K , provided that the choice of C is reasonable. As a first attempt, 
select C — 10 -4 . Use various values for C of your choice. Comment on the results. 

4.26 Consider the noise cancelation task described in Example 4.1. Write the necessary code to solve 

the problem using MATLAB® according to the following steps: 

(a) Create 5000 data samples of the signal s n = cos(&>o n), for coo = 2 x 1 0 3 7 r. 

(b) Create 5000 data samples of the AR process V] (n) = ai ■ v\{n — 1 ) + rj n (initializing at 
zero), where tj n represents zero mean Gaussian noise with variance a~ = 0.0025 and a i = 
0 . 8 . 

(0 Add the two sequences (i.e., d n — s„ + v i (/i)) and plot the resuit. This represents the con- 
taminated signal. 

(d) Create 5000 data samples of the AR process V 2 (n) — ujviin — 1) + r] n (initializing at zero), 
where r/,, represents the same noise sequence and a 2 = 0.75. 

(e) Solve for the optimum (in the MSE sense) w — [ u>o, w\] T . Create the sequence of the re- 
stored signal s„ = d n — u>oV 2 (n) — u)\V 2 (n — 1) and plot the resuit. 

(f) Repeat steps (b)-(e) using «2 = 0.9, 0.8, 0.7, 0.6, 0.5, 0.3. Comment on the results. 


http: //links. u waterloo. ca/. 
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(g) Repeat steps (b)-(e) using er^ = 0.01,0.05, 0.1,0.2, 0.5, for «2 = 0.9, 0.8, 0.7, 0.6, 0.5, 0.3. 
Comment on the results. 

4.27 Consider the channel equalization task described in Example 4.2. Write the necessary code to 

solve the problem using MATLAB® according to the following steps: 

(a) Create a signal s n consisting of 50 equiprobable ±1 samples. Plot the resuit using 
MATLAB®’s function “stem.” 

(b) Create the sequence u n — 0.5.y„ + s „_i + q n , where q n denotes zero mean Gaussian noise 
with er” = 0.01. Plot the resuit with “stem.” 

(0 Find the optimal w* = [u>o, uti, mi] 7 ', solving the normal equations. 

(d) Construet the sequence of the reconstructed signal s n = sgn(wou n + wiu n - \ + u>2U n -2)- 
Plot the resuit with “stem” using red color for the correctly reconstructed values (i.e., those 
that satisfy s n — s n ) and black color for errors. 

(e) Repeat steps (b)-(d) using different noise levels for g~. Comment on the results. 

4.28 Consider the autoregressive process estimation task described in Example 4.4. Write the neces¬ 
sary code to solve the problem using MATLAB® according to the following steps: 

(a) Create 500 samples of the AR sequence x n — —a\x n -\ — « 2 A '«-2 + Un (initializing at ze- 
ros), where a\ = 0.2, <72 = 0.1, and q n denotes zero mean Gaussian noise with — 0.5. 

(b) Create the sequence y n — x„ + v n , where v„ denotes zero mean Gaussian noise with — 1. 

(c) Implement the Kalman filtering algorithm as described in Algorithm 4.2, using y n as inputs 
and the matrices F, H, Q, R as described in Example 4.4. To initialize the algorithm, you 
can use jci|o = [0, 0] r and Pi|o = 0.1 • / 2 . Plot the predicted values x n (i.e., x n \ n ) versus the 
original sequence x n . Play with the values of the different parameters and comment on the 
obtained results. 
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5.1 INTRODUCTION 

In Chapter 4, we introduced the notion of mean-square error (MSE) optimal linear estimation and 
stated the normal equations for computing the paramneters/coefficients of the optimal estimator/filter. 
A prerequisite for the normal equations is the knowledge of the second-order statistics of the involved 
processes/variables, so that the covariance matrix of the input and the input-output cross-correlation 
vector can be obtained. However, most often in practice, all the designer has at her/his disposal is a set 
of training samples; thus, the covariance matrix and the cross-correlation vector have to be estimated 
somehow. More importantly, in a number of practical applications, the underlying statistics may be 
time-varying. We discussed this scenario while introducing Kalman filtering. The path taken there 
was to adopt a state-space representation and assume that the time dynamics of the model were known. 
However, although Kalman filtering is an elegant tool, it does not scale well in high-dimensional spaces 
due to the involved matrix operations and inversions. 

The focus of this chapter is to introduce Online learning techniques for estimating the unknown 
parameter vector. These are time-iterative schemes, which update the available estimate every time 
a measurement set (input-output pair of observations) is acquired. Thus, in contrast to the so-called 
batch processing methods, which process the whole block of data as a single entity, Online algorithms 
operate on a single data point at a time; therefore, such schemes do not require the training data set to be 
known and stored in advance. Online algorithmic schemes learn the underlying statistics from the data 
in a time-iterative fashion. Hence, one does not have to provide further statistical information. Another 
characteristic of the algorithmic family, to be developed and studied in this chapter, is its computational 
simplicity. The required complexity for updating the estimate of the unknown parameter vector is linear 
with respect to the number of the unknown parameters. This is one of the major reasons that have made 
such schemes very popular in a number of practical applications; besides complexity, we will discuss 
other reasons that have contributed to their popularity. A discussion concerning batch versus online 
algorithms is provided in Section 8.12. 

The fact that such learning algorithms work in a time-iterative mode gives them the agility to learn 
and track slow time variations of the statistics of the involved processes/variables; this is the reason 
these algorithms are also called time-adaptive or simply adoptive , because they can adapt to the needs 
of a changing environment. Online/time-adaptive algorithms have been used extensively since the early 
1960s in a wide range of applications, including signal processing, control, and Communications. More 
recently, the philosophy behind such schemes has been gaining in popularity in the context of appli¬ 
cations where data reside in large datasets, with a massive number of samples; for such tasks, storing 
all the data points in the memory may not be possible, and they have to be considered one at a time. 
Moreover, the complexity of batch processing techniques can amount to prohibitive levels, for today’s 
technology. The current trend is to refer to such applications as big data problems. 

In this chapter, we focus on a very popular class of online/adaptive algorithms that springs from the 
classical gradient descent method for optimization. Although our emphasis will be on the squared error 
loss function, the same rationale can also be adopted for other (differentiable) loss functions. The case 
of nondifferentiable loss functions will be treated in Chapter 8. A number of variants of the stochastic 
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gradient rationale, in the context of deep neural networks, is given in Chapter 18. The Online processing 
rationale will be a recurrent theme in this book. 


5.2 THE STEEPEST DESCENT METHOD 

Our starting point is the method of gradient descent, one of the most widely used methods for iterative 
minimization of a differentiable cost function, J(0), 0 e RT As does any other iterative technique, the 
method starts from an initial estimate, <? (0) , and generates a sequence, 0 U \ i — 1,2,..., such that 

0 (i '> =0 (i - ] > + IM A0 (l) , i > 0, (5.1) 

where ji, > 0. Ali the schemes for the iterative minimization of a cost function, which we will deal 
with in this book, have the general form of (5.1). Their differences are in the way that /i, and A0 (l> 
are chosen; the latter vector is known as the update direction or the search directiori. The sequence /i, 
is known as the step size or the step length at the ith iteration; note that the values of //, may either 
be constant or change at each iteration. In the gradient descent method, the choice of At? 1 ' 1 is made to 
guarantee that 

j(0 (iy ) < /((0 ( '“ 1) ), 

except at a minimizer, <?*. 

Assume that at the (i — l)th iteration step the value has been obtained. Then, for sufficiently 

small /i, and mobilizing a first-order Taylor expansion around 0 I,_I) , we can write 

j(0 (iy ) = +HiA0 u) ) ^ /(6» ( '- 1) ) +At/Vi( 0 {i ~ l) ) T A6 {i) . 

Looking carefully at the above approximation, what we basically do is to linearize locally the cost 
function with respect to A0 { '\ Selecting the search direction so that 

VJ(0 (i ~ 1) ) T A0 (l) <O, (5.2) 

this guarantees that + /r,- A0 (l> ) < /(0 (,_1) ). For such a choice, A0 (,) and must 

form an obtuse angle. Fig. 5.1 shows the graph of a cost function in the two-dimensional case, 0 e R 2 , 
and Fig. 5.2 shows the respective isovalue contours in the two-dimensional plane. Note that, in general, 
the contours can have any shape and are not necessarily ellipses; it ali depends on the functional form 
of J{0). However, because J(0) has been assumed differentiable, the contours must be smooth and 
accept at any point a (unique) tangent plane, as this is defined by the respective gradient. Furthermore, 
recall from basic calculus that the gradient vector, V/(<?), is perpendicular to the plane (line) tangent to 
the corresponding isovalue contour at the point 0 (Problem 5.1). The geometry is illustrated in Fig. 5.3; 
to facilitate the drawing and unclutter notation, we have removed the iteration index i. Note that by 
selecting the search direction which forms an obtuse angle with the gradient, it places 0 (l ~ l> + //, A0 (l> 
at a point on a contour which corresponds to a lower value of J(0). Two issues are now raised: (a) to 
choose the best search direction along which to move and (b) to compute how far along this direction 
one can go. Even without much mathematics, it is obvious from Fig. 5.3 that if /r,-11 A0 l ] \ | is too large, 
then the new point can be placed on a contour corresponding to a larger value than that of the current 
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FIGURE 5.1 

A cost function in the two-dimensional parameter space. 



FIGURE 5.2 

The corresponding isovalue curves of the cost function in Fig. 5.1, in the two-dimensional plane. All the points 8 
lying on the same (isovalue) ellipse score the same value for the cost J(8). Note that as we move away from the 
optimal value, 8 t , the values of c increase. 


contour; after all, the first-order Taylor expansion holds approximately true for small deviations from 
0 ( ;-d. 

To address the first of the two issues, let us assume //, = 1 and search for all vectors, z, with unit 
Euclidean norm and having their origin at 0 (,_l L Then it does not take long to see that for all possible 
directions, the one that gives the most negative value of the inner product, V/(0 (,_1) ) z, is that of the 
negative gradient 

VJ(0 (i ~ l) ) 

Z ~ ||V/(0 (, ' _1) )|| ' 
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FIGURE 5.3 

The gradient vector at a point 9 is perpendicular to the tangent plane (dotted line) at the isovalue curve Crossing 6. 
The descent direction forms an obtuse angle, </>, with the gradient vector. 



FIGURE 5.4 

From all the descent directions of unit Euclidean norm (dotted circle centered at 9 u ~ l) , which for notational sim- 
plicity is shown as 9 ), the negative gradient one leads to the maximum decrease of the cost function. 


This is illustrated in Fig. 5.4. Center the unit Euclidean norm ball at 0 (l ~ 1 Then from all the unit 
norm vectors having their origin at choose the one pointing to the negative gradient direction. 

Thus, for all unit Euclidean norm vectors, the steepest descent direction coincides with the (negative) 
gradient descent direction, and the corresponding update recursion becomes 


0 ( ') = 0 U 1} — /r, V J(6 U 1J ) : gradient descent scheme. 


(5.3) 


Note that we stili have to address the second point, concerning the choice of /r. The choice must be 
made in such a way to guarantee convergence of the minimizing sequence. We will come to this issue 


soon. 
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FIGURE 5.5 

Once the algorithm is at 9 1 , the gradient descent will move the point to the left, towards the minimum. The opposite 
is true for point 9^. 


Iteration (5.3) is illustrated in Fig. 5.5 for the one-dimensional case. If at the current iteration the 
algorithm has “landed” at 0 \, then the derivative of J(0) at this point is positive (the tangent of an 
acute angle, <p\), and this will force the update to move to the left towards the minimum. The scenario 
is different if the current estimate was 02- The derivative is negative (the tangent of an obtuse angle, 
<p 2 ) and this will push the update to the right, toward, again, the minimum. Note, however, that it is 
important how far to the left or to the right one has to move. A large move from, say, ()\ to the left may 
land the update on the other side of the optimal value. In such a case, the algorithm may oscillate around 
the minimum and never converge. A major effort in this chapter will be devoted in providing theoretical 
frameworks for establishing bounds for the values of the step size that guarantee convergence. 

The gradient descent method exhibits approximately linear convergence; that is, the error between 
0 (l> and the true minimum converges to zero asymptotically in the form of a geometric series. However, 
the convergence rate depends heavily on the eigenvalue spread of the Hessian matrix of J(0). The 
dependence of the convergence rate on the eigenvalues will be unraveled in Section 5.3. For large 
eigenvalue spreads, the rate of convergence can become extremely slow. On the other hand, the great 
advantage of the method lies in its low computational requirements. 

Finally, it has to be pointed out that we arrived at the scheme in Eq. (5.3) by searching ali directions 
via the unit Euclidean norm. However, there is nothing “sacred” around Euclidean norms. One can 
employ other norms, such as the i\ and quadratic z l Pz norms, where P is a positive definite matrix. 
Under such choices, one will end up with alternative update iterations (see, e.g., [23]). We will return 
to this point in Chapter 6, when dealing with Newton’s iterative and coordinate descent minimization 
schemes. 


5.3 APPLICATION TO THE MEAN-SQUARE ERROR COST FUNCTION 

Let us apply the gradient descent scheme to derive an iterative algorithm to minimize our familiar cost 
function 
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7(0) = E[(y-0 r x) 2 ] 

= a)-2 e T P + 0 T s x e, ( 5 . 4 ) 

where E x = E[xx 7 ] is the input covariance matrix and p — E[xy] is the input-output cross-correlation 
vector (Chapter 4). The respective cost function gradient with respect to 0 is easily seen to be (e.g.. 
Appendix A) 


S7J(0) = 2Z X 0 -2p. ( 5 . 5 ) 

In this chapter, we will also adhere to zero mean jointly distributed input-output random variables, 
except if otherwise stated. Thus, the covariance and correlation matrices coincide. If this is not the 
case, the covariance in (5.5) is replaced by the correlation matrix. The treatment is focused on real data 
and we will point out differences with the complex-valued data case whenever needed. 

Employing (5.5), the update recursion in (5.3) becomes 

0(0 = 0((-D _ ^ (r A -0 (, '“ u - p) 

= 0 (i ~ l) + p(p - E x 0 (i ~ l) ^ , ( 5 . 6 ) 

where the step size has been considered constant and it has also absorbed the factor 2. The more general 
case of iteration dependent values of the step size will be discussed soon after. Our goal now becomes 
that of searching for all values of /i that guarantee convergence. To this end, define 

c« := 0(0-0^ (5.7) 

where 0 * is the (unique) optimal MSE solution that results by solving the respective normal equations 
(Chapter 4), 


Z x 0* = p. (5.8) 

Subtracting 0* from both sides of (5.6) and plugging in (5.7), we obtain 

c (,) = C (,_1) + P (^P — S x C ( ' , ~ l) — £ x 0*j 

= - iiExC^-V = (/ - /rT t )c (i_1) . (5.9) 

Recall that S x is a symmetric positive definite matrix (Chapter 2); hence all its eigenvalues are positive, 
and moreover (Appendix A. 2) it can be written as 


Z x = QAQ t , 


( 5 . 10 ) 


where 


A := diagfA.],..., A./} and Q := [q { , q 2 ,..., q,]. 
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with Xj, q j, 7 = 1,2,...,/, being the eigenvalues and the respective normalized ( orthogonal ) eigen- 
vectors of the covariance matrix, 1 represented by 

qUj =hj. k,j = l,2,...,l=*Q T = QT\ 

That is, the matrix Q is orthogonal. Plugging the factorization of E x into (5.9), we obtain 

c (i) = Q(I-n A) Q T c (i ~ l \ 


or 

v (i) = (I-n A)v (i ~ l \ (5.11) 

where 

v (0 := Q r c (i) , i = 1,2,.... (5.12) 

The previously used “trick” is a Standard one and its aim is to “decouple” the various components of 
6^ in (5.6). Indeed, each one of the components v (r> (j), j = 1.2,...,/, of v (l ’ follows an iteration 
path, which is independent of the rest of the components; in other words, 

v (i \j) = (1 - = (1 - nkjfv^Q) 

= • •• = (1 — fiXjY v 1 ' 0 - (j), (5.13) 


where v^(j) is the /th component of v (0> , corresponding to the initial vector. It is readily seen that if 
|1 -fiXj\ < 1-^-1 < 1 -fiXj < 1, y = l,2,...,/, (5.14) 

the geometric series tends to zero and 

v (i) —► 0 Q T (6 (i) - 6»*) —> 0 => 0 (i) —> 6 »*. (5.15) 

Note that (5.14) is equivalent to 


0 < /i < 2/k max : condition for convergence, 


(5.16) 


where /, niax denotes the maximum eigenvalue of Z x . 

Time constant : Fig. 5.6 shows a typical sketch of the evolution of v l) (j) as a function of the 
iteration steps for the case 0 < 1 — jiXj < 1. Assume that the envelope, denoted by the red line, is (ap- 
proximately) of an exponential form, fit) — exp(— t/xj). Plug into f(t), as the values corresponding 
at the time instants t = iT and t = (i — 1 )T , the values of v^(j), u (i ~ l, (7) from (5.13); then the time 
constant results as 

-1 

7 ln(l — /xXj)’ 


1 In contrast to other chapters, we denote eigenvectors with q and not as u, since at some places the latter is used to denote the 
input random vector. 
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FIGURE 5.6 

Convergence curve for one of the components of the transformed error vector. Note that the curve is of an approxi¬ 
mate exponentially decreasing type. 


assuming that the sampling time between two successive iterations is 7' = 1. For small values of /x, we 
can write 

1 

r , for/x«;l. 

j 

That is, the slowest rate of convergence is associated with the component that corresponds to the 
smallest eigenvalue. However, this is only true for small enough values of /r. For the more general case, 
this may not be true. Recall that the rate of convergence depends on the value of the term 1 — \xXj. This 
is also known as the jth mode. Its value depends not only on X; but also on /r. Let us consider as an 
example the case of /x taking a value very close to the maximum allowable one, /r ~ 2//, nlax . Then 
the mode corresponding to the maximum eigenvalue will have an absolute value very close to one. 
On the other hand, the time constant of the mode corresponding to the minimum eigenvalue will be 
controlled by the value of 11 — 2/, nlm //, max |, which can be much smaller than one. In such a case, the 
mode corresponding to the maximum eigenvalue exhibits slower convergence. 

To obtain the optimum value for the step size, one has to select its value in such a way that the 
resulting maximum absolute mode value is minimized. This is a min/max task, 

/x 0 = arg min^ max11 — fxXj \, 
s.t. 11 — /xXj | < 1, 7 = 1,2,...,/. 

The task can be solved easily graphically. Fig. 5.7 shows the absolute values of the modes (corre¬ 
sponding to the maximum, minimum, and an intermediate eigenvalue). The (absolute) values of the 
modes initially decrease as // increases, and then they start increasing. Observe that the optimal value 
results at the point where the curves for the maximum and minimum eigenvalues intersect. Indeed, this 
corresponds to the minimum-maximum value. Moving /r away frorn /i 0 , the maximum mode value 
increases; increasing /i„, the mode corresponding to the maximum eigenvalue becomes larger, and de¬ 
creasing it, the mode corresponding to the minimum eigenvalue is increased. At the intersection we 
have 
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FIGURE 5.7 

For each mode, increasing the value of the step size, the time constant starts decreasing and then after a point starts 
increasing. The full black line corresponds to the maximum eigenvalue, the red one to the minimum, and the dotted 
curve to an intermediate eigenvalue. The overall optimal, /x 0 , corresponds to the value where the red and full black 
curves intersect. 


1 — (1 Mo^-max)? 


which results in 


do : 


^■max + A, n 


( 5 . 17 ) 


At the optimal value, /i 0 , there are two slowest modes; one corresponding to A m i n (i.e., 1 — /x 0 A m ; n ) 
and another one corresponding to /, rnax (i.e., 1 — /r„A inax ). They have equal magnitudes but opposite 
signs, and they are given by 


where 


± 


P ~ 1 
p + 1 


P := 


7.max 

A rn i n 


In other words, the convergence rate depends on the eigenvalues spread ofthe covariance matrix. 
Parameter error vector convergence: From the definitions in (5.7) and (5.12), we get 


0 (i) = 0 * + Qv (i) 

= 0* + [ qi ,...,q,][vV(l ),..., i^W 
l 

= 0* + ^q k v ( '\k), 
k= 1 


( 5 . 18 ) 










5.3 APPLICATION TO THE MEAN-SQUARE ERROR COST FUNCTION 189 


0 {i) U) = 0*U) + J2‘lkU)v i0 \k)(l-^k) i , j — 1,2,.../. (5.19) 

k=\ 

In other words, the components of 0 in converge to the respective components of the optimum vector 
0* as a weighted average of powers of 1 — ///./;, i.e., (1 — /i/./,)'. Computing the respective time con¬ 
stant in closed form is not possible; however, we can state lower and upper bounds. The lower bound 
corresponds to the time constant of the fastest converging mode and the upper bound to the slowest of 
the modes. For small values of fi <$C 1, we can write 

1 1 

-<r<-. (5.20) 

/7 Anax /7 An in 

The leaming curve: We now turn our focus to the mean-square error (MSE). Recall from (4.8) that 

/(0 (O ) = J{0 *) + (0 (i) - 0,.) T Z x (0 (,) - 0*), (5.21) 

or, mobilizing (5.18) and (5.10) and taking into consideration the orthonormality of the eigenvectors, 
we obtain 


j{e (i) ) = j{o*)+Y2^M i) u)\ 2 =► 

7=1 

J(0 (i) ) = 7(0*) + £>;(1 - ^j) 2i \v i0> (j )| 2 , 

7=1 


(5.22) 


which converges to the minimum value 7(0*) asymptotically. Moreover, observe that this convergence 
is monotonic, because Ay(l — //a ; ) 2 is positive. Following similar arguments as before, the respective 
time constants for each one of the modes are now 


r mse __ 1 

j ~ 21n(l -iikj) 


1 

2 jikj 


(5.23) 


Example 5.1. The aim of the example is to demonstrate what we have said so far, concerning the con¬ 
vergence issues of the gradient descent scheme in (5.6). The cross-correlation vector was chosen to be 

p= [0.05, 0.03] r , 


and we consider two different covariance matrices, 

1 0 
0 1 


r, = 


i o 
0 0.1 


r 2 = 


Note that for the case of Sj, both eigenvalues are equal to 1, and for L\ they are /. i = 1 and /, 2 = 0.1 
(for diagonal matrices the eigenvalues are equal to the diagonal elements of the matrix). 
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FIGURE 5.8 

The black curve corresponds to the optimal value /x = /x D and the gray one to /x = /x„/2, for the case of an input 
covariance matrix with unequal eigenvalues. 


Fig. 5.8 shows the error curves for two values of /x for the case of L \; the gray one corresponds 
to the optimum value (/x 0 = 1.81) and the red one to /x = /x„/2 = 0.9. Observe the faster convergence 
towards zero that is achieved by the optimal value. Note that it may happen, as is the case in Fig. 5.8, 
that initially the convergence for some /x ^ /x 0 will be faster compared to /x 0 . What the theory guaran- 
tees is that, eventually, the curve corresponding to the optimal will tend to zero faster than for any other 
value of /x. Fig. 5.9 shows the respective trajectories of the successive estimates in the two-dimensional 
space, together with the isovalue curves; the latter are ellipses, as we can readily deduce if we look care- 
fully at the form of the quadratic cost function written as in (5.21). Compare the zig-zag path, which 
corresponds to the larger value of /x = 1.81, with the smoother one, obtained for the smaller step size 
/i = 0.9. 

For comparison reasons, to demonstrate the dependence of the convergence speed on the eigenval¬ 
ues spread, Fig. 5.10 shows the error curves using the same step size, /x = 1.81, for both cases, X) and 
X>. Observe that large eigenvalues spread of the input covariance matrix slows down the convergence 
rate. Note that if the eigenvalues of the covariance matrix are equal to, say, X, the isovalue curves are 
circles; the optimal step size in this case is /x = 1/A. and convergence is achieved in only one step 
(Fig. 5.11). 

TIME-VARYING STEP SIZES 

The previous analysis cannot be carried out for the case of an iteration dependent step size. It can be 
shown (Problem 5.2) that in this case, the gradient descent algorithm converges if 
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FIGURE 5.9 

The trajectories of the successive estimates (dots) obtained by the gradient descent algorithm for (A) the larger value 
of /x = 1.81 and (B) the smaller value of /x = 0.9. In (B), the trajectory toward the minimum is smooth. In contrast, 
in (A), the trajectory consists of zig-zags. 


• /x; —> 0, as i —> oo, 

E OO 

;=1 /X; = OO. 

A typical example of sequences which comply with both conditions are those that satisfy the following: 

OO OO 

£>f<oo, ^/x;= oo, (5.24) 

i=i i=i 



as, for example, the sequence 
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FIGURE 5.10 

For the same value of /x = 1.81, the error curves for the case of unequal eigenvalues (Ai = 1 and X 2 = 0.1) (gray) 
and for equal eigenvalues (ki = A 2 = 1). For the latter case, the isovalue curves are circles; if the optimal value 
/x 0 = 1 is used, the algorithm converges in one step. This is demonstrated in Fig. 5.11. 



FIGURE 5.11 

When the eigenvalues of the covariance matrix are ali equal to a value, X, the use of the optimal /x„ = 1/A achieves 
convergence in one step. 


Note that the two (sufficient) conditions require that the sequence tends to zero, yet its infinite sum 
diverges. We will meet this pair of conditions in various parts of this book. The previous conditions 
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state that the step size has to become smaller and smaller as iterations progress, but this should not 
take place in a very aggressive manner, so that the algorithm is left to be active for a sufficient number 
of iterations to learn the solution. If the step size tends to zero very fast, then updates are practically 
frozen after a few iterations, without the algorithm having acquired enough information to get close to 
the solution. 

5.3.1 THE COMPLEX-VALUED CASE 

In Section 4.4.2 we stated that a function / : C 1 1 —> R is not differentiable with respect to its complex 
argument. To deal with such cases, the Wirtinger calculus was introduced. In this section, we use this 
mathematically convenient tool to derive the corresponding steepest descent direction. 

To this end, we again employ a first-order Taylor series approximation [22]. Let 


0 — 0 r + j0\. 


Then the cost function 


is approximated as 


J(0):C l 


[0, +oo) 


J(0 + A0) = J(0 r + A6 0, + A 0i) 

= J (0r > 0i) + A0jV r J(0 r ,0i) + A 0f ViJ{0r,0i), (5.25) 

where V,- (V,) denotes the gradient with respect to 0, ( 0 ,•). Taking into account that 

A0 + A0* A0 - A0* 

A0 r = -, AOj = -, 

2 2 j ' 

it is easy to show (Problem 5.3) that 

J(0 + A0) = J(0) + Re{A0 H V e *J(0)}, (5.26) 

where Vg*J(0) is the CW-derivative, defined in Section 4.4.2 as 

Vo*J(0) = \{VrJ(6) + jViJ(0)). 

Looking carefully at (5.26), it is straightforward to observe that the direction 

A 0 = -iiS/ e *J(0) 


makes the updated cost equal to 


J(0 + A0) = J(0) - /i || V r /(0)f, 

which guarantees that J(0 + A0) < J (0); it is straightforward to see, by taking into account the defi- 
nition of an inner product, that the above search direction is the one of the largest decrease. Thus, the 
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counterpart of (5.3) becomes 


<?(') = 0 (l J) ) : complex gradient descent scheme. 


(5.27) 


For the MSE cost function and for the linear estimation model, we get 

J(0) = E [(y - 0 H x) (y - 

= 07 + e H e x e - e H p - p H o, 

and taking the gradient with respect to 0* , by treating 0 as a constant (Section 4.4.2), we obtain 


V $ *J{0) = E x 9-p, 


and the respective gradient descent iteration is the same as in (5.6). 


5.4 STOCHASTIC APPROXIMATION 

Solving for the normal equations (Eq. (5.8)) as well as using the gradient descent iterative scheme (for 
the case of the MSE), one has to have access to the second-order statistics of the involved variables. 
However, in most of the cases, this is not known and it has to be approximated using a set of obser- 
vations. In this section, we turn our attention to algorithms that can learn the statistics iteratively via 
the training set. The origins of such techniques are traced back to 1951, when Robbins and Monro 
introduced the method of stochastic approximation [79], or the Robbins-Monro algorithm. 

Let us consider the case of a function that is defined in terms of the expected value of another one, 
namely, 

/(0) = E[</>(M)], OeR 1 , 

where ^ is a random vector of unknown distribution. The goal is to compute a root of f(0). If the 
distribution was known, the expectation could be computed, at least in principle, and one could use any 
root-finding algorithm to compute the roots. The problem emerges when the statistics are unknown; 
hence the exact form of f(0) is not known. All one has at her/his disposal is a sequence of i.i.d. 
observations r] l} , Robbins and Monro proved that the following algorithm, 


0 n = 0 n _i — p. n <j)(O n -\, r ] n ): Robbins-Monro scheme, 


(5.28) 


starting from an arbitrary initial condition, 0- 1 , converges' to a root of f(0), under some general 
conditions and provided that (Problem 5.4) 


“ The original paper dealt with scalar variables only and the method was later extended to more general cases; see [96] for a 
related discussion. 

3 Convergence here is meant to be in probability (see Section 2.6). 







5.4 STOCHASTIC APPROXIMATION 195 


<°°’ 

r. i in 

-»■ oo : convergence conditions. 

n 

n 



(5.29) 


In other words, in the iteration (5.28), we get rid of the expectation operation and use the value of 
</>(•, •), which is computed using the current observations and the currently available estimate. That is, 
the algorithm learns both the statistics as well as the root; two in one! The same comments made for 
the convergence conditions, met in the iteration dependent step size case in Section 5.3, are valid here 
as well. 

In the context of optimizing a general differentiable cost function of the form 


7(0) = E[£(0,y,x)], 


(5.30) 


the Robbins-Monro scheme can be mobilized to find a root of the respected gradient, i.e., 


V/(0) = E[V£(0,y,x)], 


where the expectation is with respect to the pair (y, x). As we have seen in Chapter 3, such cost 
functions are also known as the expected risk or the expected loss in machine learning terminology. 
Given the sequence of observations (y n ,x„), n = 0, 1,..., the recursion in (5.28) now becomes 


0n —0/j—1 dn V 11(0n I, V,/. -V. 


(5.31) 


Let us now assume, for simplicity, that the expected risk has a unique minimum, 0*. Then, according 
to the Robbins-Monro theorem and using an appropriate sequence //„. 0„ will converge to 0*. How- 
ever, although this information is important, it is not by itself enough. In practice, one has to cease 
iterations after a finite number of steps. Hence, one has to know something more concerning the rate 
of convergence of such a scheme. To this end, two quantities are of interest, namely, the mean and the 
covariance matrix of the estimator at iteration n, i.e., 

E[0„], Cov(e„). 


It can be shown (see [67]) that if ji n = O (I /n ) 1 and assuming that iterations have brought the estimate 
close to the optimal value, then 


E[e„] = 0* + 


i 

c 

n 


(5.32) 


and 


cov(e„) = -v + 0(i/« 2 ), 

n 


(5.33) 


where c and V are constants that depend on the form of the expected risk. The above formulas have 
also been derived under some further assumptions concerning the eigenvalues of the Hessian matrix 
of the expected risk. 5 It is worth pointing out here that, in general, the convergence analysis of even 


4 The Symbol O denotes order of magnitude. 

5 The proof is a bit technical and the interested reader can look at the provided reference. 
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simple algorithms is a mathematically tough task, and it is common to carry it out under a number of 
assumptions. What is important to keep from (5.32) and (5.33) is that both the mean and the variances 
of the components follow an 0(1/;?) pattern (in (5.33), 0(1/«) is the prevailing pattern, since the 
0(l/« 2 ) dependence goes to zero much faster). Furthermore, these formulas indicate that the param- 
eter vector estimate fluctuates around the optimal value. Indeed, in contrast to Eq. (5.15), where 0 tl) 
converges to the optimal value, here it is the corresponding expected value that converges. The spread 
around the expected value is controlled by the respective covariance matrix. 

The previously reported fluctuation depends on the choice of the sequence /i„, being smaller for 
smaller values of the step size sequence. However, /i ( , cannot be made to decrease very fast due to 
the two convergence conditions, as discussed before. This is the price one pays for using the noisy 
version of the gradient and it is the reason that such schemes suffer from relatively slow convergence 
rates. However, this does not mean that such schemes are, necessarily, the poor relatives of other more 
“elaborate” algorithms. As we will discuss in Chapter 8, their low complexity requirements make this 
algorithmic family to be the preferred choice in a number of practical applications. 

APPLICATION TO THE MSE LINEAR ESTIMATION 

Let us apply the Robbins-Monro algorithm to solve for the optimal MSE linear estimator if the covari¬ 
ance matrix and the cross-correlation vector are unknown. We know that the solution corresponds to 
the root of the gradient of the cost function, which can be written in the form (recall the orthogonality 
theorem from Chapter 3 and Eq. (5.8)) 


E x 0 - p = E |^x(x 7 0 — y) 


= 0 . 


Given the training sequence of observations, (y n ,x n ), which are assumed to be i.i.d. drawn from the 
joint distribution of (y, x), the Robbins-Monro algorithm becomes 

0n — 0n— 1 + t^n^n (fn %n ^n— l) > (5.34) 


which converges to the optimal MSE solution provided that the two conditions in (5.29) are satisfied. 
Compare (5.34) with (5.6). Taking into account the definitions, S x = E[xx 7 ], p — E[xy], the former 
equation results from the latter one by dropping out the expectation operations and using an iteration 
dependent step size. Observe that the iterations in (5.34) coincide with time updates', time has now 
explicitly entered into the scene. This prompts us to start thinking about modifying such schemes ap- 
propriately to track time-varying environments. Algorithms such as the one in (5.34), which resuit from 
the generic gradient descent formulation by replacing the expectation by the respective instantaneous 
observations, are also known as stochastic gradient descent schemes. 

Remarks 5.1. 

• Ali the algorithms to be derived next can also be applied to nonlinear estimation/filtering tasks of 
the form 

l 

y = = 0 T <l>, 

k= 1 
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FIGURE 5.12 

The red line corresponds to the true value of the unknown parameter. The black curve corresponds to the aver- 
age over 200 realizations of the experiment. Observe that the mean value converges to the true value. The bars 
correspond to the respective Standard deviation, which keeps decreasing as n grows. 


and the place of x is taken by 4>- where 

<l> = [</>i(x),..., </>/(x)] r . 

Example 5.2. The aim of this example is to demonstrate the pair of Eqs. (5.32) and (5.33), which 
characterize the convergence properties of the stochastic gradient scheme. 

Data samples were first generated according to the regression model 

yn — @ ^ 

where 6 e R 2 was randomly chosen and then fixed. The elements of x„ were i.i.d. generated via a 
normal distribution A/(0, 1) and rj n are samples of a white noise sequence with variance equal to 
a 2 = 0.1. Then the observations (y „, x n ) were used in the recursive scheme in (5.34) to obtain an 
estimate of 0. The experiment was repeated 200 times and the mean and variance of the obtained 
estimates were computed for each iteration step. Fig. 5.12 shows the resulting curve for one of the 
parameters (the trend for the other one being similar). Observe that the mean values of the estimates 
tend to the true value, corresponding to the red line, and the Standard deviation keeps decreasing as n 
grows. The step size was chosen equal to fi n — \/n. 
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5.5 THE LEAST-MEAN-SQUARES ADAPTIVE ALGORITHM 

The stochastic gradient descent algorithm in (5.34) converges to the optimal MSE solution provided 
that fi n satisfies the two convergence conditions. Once the algorithm has converged, it “locks” at the 
obtained solution. In a case where the statistics of the involved variables/processes and/or the unknown 
parameters start changing, the algorithm cannot track the changes. Note that if such changes occur, the 
error term 


will attain larger values; however, because is very small, the increased value of the error will not 
lead to corresponding changes of the estimate at time n. This can be overcome if one sets the value of 
/r„ to a preselected /z.wy/ value, /i. The resulting algorithm is the celebrated least-mean-squares (LMS) 
algorithm [102]. 

Algorithm 5.1 (The LMS algorithm). 

• Initialize 

- 6-i = 0 e R 7 ; other values can also be used. 

- Select the value of \i. 

• For n — 0. 1.... , Do 

— e n = y n 6 n _^x„ 

— 0^ = 6,i—i *E [xc n x n 

• End For 

In case the input is a time series, 6 u n , the initialization also involves the samples, ii -\,..., u-i+i = 
0, to form the input vectors, u n , n = 0,1,..., / — 2. The complexity of the algorithm amounts to 21 
multiplications/additions (MADs) per time update. We have assumed that observations start arriving at 
time instant n — 0, to be in line with most references treating the LMS. 

Let us now comment on this simple structure. Assume that the algorithm has converged close to 
the solution; then the error term is expected to take small values and thus the updates will remain close 
to the solution. If the statistics and/or the system parameters now start changing, the error values are 
expected to increase. Given that /i has a constant value, the algorithm has now the “agility” to update 
the estimates in an attempt to “push” the error to lower values. This small variation of the iterative 
scheme has important implications. The resulting algorithm is no more a member of the Robbins- 
Monro stochastic approximation family. Thus, one has to study its convergence conditions as well as 
its performance properties. Moreover, since the algorithm now has the potential to track changes in the 
values of the underlying parameters, as well as the statistics of the involved processes/variables, one 
has to study its performance in nonstationary environments; this is associated to what is known as the 
tracking performance of the algorithm, and it will be treated at the end of the chapter. 


6 


Recall our adopted notation from Chapter 2; in this case we use u n in place of x n . 
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5.5.1 CONVERGENCE AND STEADY-STATE PERFORMANCE OF THE LMS IN 
STATIONARY ENVIRONMENTS 


The goal of this subsection is to study the performance of the LMS in stationary environments, that is, 
to answer the following questions. (a) Does the scheme converge and under which conditions? And if it 
converges, where does it converge? Although we introduced the scheme having in mind nonstationary 
environments, we stili have to know how it behaves under stationarity; after all, the environment can 
change very slowly, and it can be considered “locally” stationary. 

The convergence properties of the LMS, as well as of any other online/adaptive algorithm, are 
related to its transierit characteristics; that is, the period from the initial estimate until the algorithm 
reaches a “steady-state” mode of operation. In general, analyzing the transient performance of an Online 
algorithm is a formidable task indeed. This is also true even for the very simple structure of the LMS 
summarized in Algorithm 5.1. The LMS update recursions are equivalent to a time-varying, nonlinear 
(Problem 5.5), and stochastic in nature estimator. Many papers, some of them of high scientific insight 
and mathematical skill, have been produced. However, with the exception of a few rare and special 
cases, the analysis involves approximations. Our goal in this book is not to treat this topic in detail. 
Our focus will be restricted to the most “primitive” of the techniques, which are easier for the reader to 
follow compared with more advanced and mathematically elegant theories; after all, even this primitive 
approach provides results that turn out to be in agreement with what one experiences in practice. 

Convergence of the Parameter Error Vector 


Deline 


C n . — 0/2 0 *, 


where 0* is the optimal solution resulting from the normal equations. The LMS update recursion can 
now be written as 



Because we are going to study the statistical properties of the obtained estimates, we have to switch 
our notation from that referring to observations to the one involving the respective random variables. 
Then we can write 


c„ = c„_i + /rx(y-ej_ 1 x + 0jx-0;[x) 
= C/;—t /Ltxx Cn — i T /rxe* 

= (/ - /xxx r )c„_i + /rxe*, 


(5.35) 


where 





(5.36) 


is the error random variable associated with the optimal 0 *. Compare (5.35) with (5.9). They look 
similar, yet they are very different. First, the latter of the two involves the expected value, E x , in place 
of the respective variables. Moreover in (5.35), there is a second term that acts as an extemal input to 
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the difference stochastic equation. From (5.35) we obtain 

E[c„] = E [(/ - /rxx r )c„_i j + /xE[xe*]. (5.37) 

To proceed, it is time to introduce assumptions. 

Assumption 1. The involved random variables are jointly linked via the regression model, 

y = 0jx+q, (5.38) 

where q is the noise variable with variance and it is assumed to be independent of x. Moreover, 
successive samples r ) n , which generate the data, are assumed to be i.i.d. We have seen in Remarks 4.2 
and Problem 4.4 that in this case, 0* = 0 O and aj — o~. Also, due to the orthogonality condition, 
E[xe*] = 0. In addition, a stronger condition will be adopted, and e* and x will be assumed to be 
statistically independent. This is justified by the fact that under the above model, e, tn — rf n , and the 
noise sequence has been assumed to be independent of the input. 

Assumption 2. (Independence assumption) Assume that c„_[ is statistically independent of both x 
and e*. No doubt this is a strong assumption, but one we will adopt to simplify computations. Some- 
times there is a tendency to “justify” this assumption by resorting to some special cases, which we will 
not do. If one is not happy with the assumption, he/she has to look for more recent methods, based 
on more rigorous mathematical analysis; of course, this does not mean that such methods are free of 
assumptions. 

I. Convergence in the mean: Having adopted the previous assumptions, (5.37) becomes 

E[c„] = E^7-/rxx' / '^c„_ij 

= (I - ptSMCn-i]. (5.39) 

Following similar arguments as in Section 5.3, we obtain 


E[v„] = (/ - fiA) E[v„_i], 
where S x = QAQ 1 and v„ = Q 1 c„. The last equation leads to 

E[0„] —»■ 0* as n —> oo, 


provided that 

2 

0 < M < T-■ 

^max 

In other words, in a stationary environment, the LMS converges to the optimal MSE solution in the 
mean. Thus, by fixing the value of the step size to be a constant, we lose something; the obtained 
estimates, even after convergence, hover around the optimal solution. The obvious task to be considered 
next is to study the respective covariance matrix. 

II. Error vector covariance matrix: From (5.39), applying it recursively and assuming for the 
initial condition that E[c_i] = 0, we have E[c„] = 0. In any case, borrowing the same arguments used 
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in establishing the convergence in the mean, it turns out that the latter is approximately true for large 
enough values of n. irrespective of the initialization. Thus, from (5.35) we get 

Ec,n - — ] = Jj c u — i fl E[XX C/7 — 1 C /z — | ] 

- /iE[c„_icJ_ 1 xx r ] + /^ 2 E[e^xx r ] 

+ ( u 2 E[xx 7 c„_icJ_ 1 xx 7 ’], (5.40) 

where we have used the independence between e* and c„_i and the fact that e* is orthogonal to x in 
order to set to zero some of the terms. Taking into consideration the adopted independence assumptions 
and assuming that the input vector follows a Gaussian distributiori, (5.40) becomes 

Ec,n = E cn — 1 /i A x I 

+ 2/i E x E cn — i E x + ii~E x tmce{£ x Zl c n —\} 

+ H 2 cr ; i S x , (5.41) 

where the Gaussian assumption has been exploited to express the term involving fourth-order moments 
(e.g., [74]) as 

E[xx ^c^—jxx ] = 2E x E cn —\E x -f- E x tracef E x E cn _i}. 

Mobilizing the definition of v„ = Q T c n , (5.41) results in (Problem 5.6) 

T 

^v,n ~ Q c,nQ = ^v,n— 1 1 1A 

+ 2/i 2 Ar„,„_iA + /r 2 Atrace] A + /ltct 2 A. (5.42) 

Note that our interest lies at the diagonal elements of E v n , since these correspond to the variances 
of the respective elements of 0„ — 0*, and correspondingly to v„ — i>*. Collecting ali the diagonal 
elements in a vector, s n , a close inspection of the diagonal elements of E v n in (5.42) can persuade the 
reader that the following difference equation is true: 

s n — (I — 2/j.A + 2fi 2 A 2 + (i 2 XX T )s„-i + fx 2 cr 2 X, (5.43) 


where 

X := [A.i, A.2,..., Xi] T . 

It is well known from linear system theory that the difference equation in (5.43) is stable if the eigen- 
values of the matrix. 


A:= I — 2/j.A + 2 /j 2 A 2 + ii 2 XX T 
= (/ — /x A) 2 + h 2 A 2 + I m 2 XX t , (5.44) 

have magnitude less than one. This can be guaranteed if the step size /i is chosen such as (Prob¬ 
lem 5.7) 
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or 


0 < /x < 


2 

£/=l *<• ’ 


f 1 < 


2 

trace{ I7 A } 


(5.45) 


The last condition guarantees that the variances remain bounded. Recall the number of assumptions 
made. Thus, to be on the safe side, // must be selected so that it is not close to this upper bound. 

III. Excess mean-square error: We know that the minimum MSE is achieved at Q. t . Any other 
weight vector results in higher values of the MSE. We have already said that in the steady state, the 
estimates obtained via the LMS fluctuate randomly around 0*; thus, the MSE will be larger than the 
minimum / min . This “extra” error power, denoted as J exc , is known as the excess MSE. Also, the ratio 


, . Texe 
M:= — 

is known as the misadjustment. No doubt, we should seek for the relationship of A4 with // and in 
practice we would like to adjust /x accordingly, in order to get a value of A4 as small as possible. 
Unfortunately, we will soon see that there is a tradeoff in achieving that. Making A4 small, the conver- 
gence speed becomes slower and vice versa; there is no free lunch! 

By the respective definitions, we have 

e„ = y„ - Oj_jX 
= e*,„ -c£jx, 


or 

e « = e L + c£ 1 xx r c„_i - 2e*,„cJ_ 1 x. (5.46) 

Taking the expectation on both sides and exploiting the assumed independence between c„_i and x 
and e*,„, as well as the orthogonality between e*,,, and x, we get 

/„:=E[e;] = y rnin + E[c ) y_ 1 xx r c„_i] 

= T min + E[trace{cJ_ 1 xx r c„_i}] 

= /min + trace{i; A .i; c ,„_i}, (5.47) 

where the property tracej A l> \ = tracej B A | has been used. Thus, we can finally write 


J exc n = tracej L x A r „_i |: excess MSE at time instant n. 


(5.48) 


7 The time index n is explicitly used for e, e*, y, since the formula to be derived is also valid for time-varying environments, 
and it is going to be used later on for the time-varying statisties case. 
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Let us now elaborate on it a bit more. Taking into account that QQ 7 = I. we get 

T e xc,« = tracejgg 7 E X QQ 7 Z c , n -\QQ 7 } 

— tracej QAE v ^ n -\ Q 7 } = trace{A S Vt „- 1 } 

1 

= ^ T i 1 ]^ = k S n — ], 

1=1 


(5.49) 


where s„ is the vector of the diagonal elements of E VJ , and obeys the difference equation in (5.43). 
Assuming that /x has been chosen so that convergence is guaranteed, then for large values of n the 
steady state has been reached. In a more formal way, an Online algorithm has reached the steady state if 

E[0„] = E[0„_i] = Constant, (5.50) 

— Ef )n -] — Constant. (5.51) 


Thus, in steady state, we assume in Eq. (5.43) that s n — s n - 1 ; if this is exploited in (5.49) this leads to 
(Problem 5.10) 


^ /rcr“trace{ E x } 

2 — /xtracej E x } ’ 


and for the misadjustment (since under our assumptions y nnn = cr 7 ), 


(5.52) 


M ~ 


/utraceflA-} 

2 — /rtracefi^v-} ’ 


which, for small values of / 1 , leads to 


-fic 7,|trace{i; v }: 


excess MSE 


(5.53) 


and 


M - -/xtrace{i7 v }: 


misadjustment. 


(5.54) 


That is, the smaller the value of /i. the smaller the excess MSE. 

IV. Time constant. Note that the transient behavior of the LMS is described by the difference equa¬ 
tion in (5.43) and its convergence rate (the speed with which it forgets the initial conditions until it 
settles to its steady-state operation) depends on the eigenvalues of A in (5.44). To simplify the formu¬ 
las, assume /j. to be small enough so that A is approximated as (/ — /xA) 2 . Following similar arguments 
as those used for the gradient descent in Section 5.3, we can write 


t lms ~ 1 

7 2 

That is, the time constant for each one of the modes is inversely proportional to /x. Hence, the slower 
the rate of convergence (small values of /x), the lower the misadjustment and vice versa. Viewing it 
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differently, the more time the algorithm spends on learning, prior to reaching the steady state, the 
smaller is its deviation from the optimal. 


5.5.2 CUMULATIVE LOSS BOUNDS 


In the previously reported analysis method for the LMS performance, there is an underlying assumption 
that the training samples are generated by a linear model. The focus of the analysis was to investigate 
how well our algorithm estimates the unknown model once steady state has been reached. This path of 
analysis is very popular and well suited for a number of tasks, such as system Identification. 

An alternative route for studying the performance of an algorithm, shedding light from a different 
angle, is via the so-called cumulative loss. Recall that the main goal in machine learning is prediction\ 
hence, measuring the prediction accuracy of an algorithm, given a set of observations, becomes the 
main goal. However, this performance index should be measured against the generalization ability of 
the algorithm, as pointed out in Chapter 3. In practice, this can be done in different ways, such as via 
the leave-one-out method (Section 3.13). 

For the case of the squared error loss function, the cumulative loss over N observation samples is 
defined as 


At-1 At—1 

£cum ='Yl,(yn- y,i) 2 = ^2 (>'« ~ e n- 1*«) : cumulative loss. 

n =0 n =0 


(5.55) 


Note that 0 n -\ has been estimated based on observations up to and including the time instant n — 1. 
So, the training pair ( y n , x n ) can be considered as a test sample, not involved in the training, for 
measuring the error. It must be pointed out, however, that the cumulative loss is not a direct measure of 
the generalization performance associated with the finally obtained parameter vector. Such a measure 
should involve On-i, tested against a number of test samples. 

The goal of the family of methods which build around the cumulative loss is to derive corresponding 
upper bounds. Our aim here is to outline the essence behind such approaches, without resorting to 
proofs; these comprise a series of bounds and can become a bit technical. The interested reader can 
consuit the related references. We will return to the cumulative loss and related bounds in the context 
of the so-called regret analysis in Chapter 8. 

In the context of the LMS, the following theorem has been proved in [21 ]. 

Theorem 5.1. Let C — max„ ||x„||, pt — f/C 2 , 0 < f < 2. Then the set of predictions, yo,. .., yN-i, 
generated by Algorithm 5.1 satisfies the following bound: 


N—l 

V (y n - y n ) 2 < inf 

0 


C 2 ||flll 2 

2/1(1 -f)c 


qe,s) | 

(2 - /i) 2 c(i - c) r 


(5.56) 


where 0 < c < 1 and 


N -1 

C(0,S)=^2 ( y„-0 T x n ) 2 , 
n=0 


(5.57) 


where S — {(y„,x n ), n = 0, 1. N—l}. 
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One can then tune A optimally to minimize the upper bound. Note that this is a worst-case scenario. 
The tuning is achieved by restricting the set of linear functions, so that ||0| | < 0. Let 


L@(S) = min C(6, S), 
ll«ll<® 

and also assume that there is an upper bound L, such that 

\L & (S)\<L. 


(5.58) 


(5.59) 


Then it can be shown [21] that 

JV-1 

- >’«) 2 < L ®( s ) + 2&CVZ + (0C) 2 . (5.60) 

n= 0 

Note that the previous analysis has been carried out without invoking any probabilistic arguments. 

Alternative bounds are derived by mobilizing different assumptions [21]. Bounds of this kind, involving 

similar assumptions, are frequently encountered for the analysis of various algorithms. We will meet 

such examples later in this book. 

Remarks 5.2. 

• The analysis method presented in this section, concerning the transient and steady-state perfor- 
mance of the LMS, can be considered as the most primitive and goes back to the early work of 
Widrow and Hoff [102]. Another popular path for analyzing stochastic algorithms in an averaging 
sense is the so-called averaging method, which operates under the assumption of small values of 
the step size /i [56]. Another approach that builds around the assumption of small step sizes, /x ~ 0, 
is the so-called ordinary-differential-equation approach (ODE) [57]. The difference update equa- 
tion is “transformed” to a differential equation, which paves the way for using arguments from the 
Lyapunov stability theory. An alternative elegant theoretical tool, which can be used as a vehicle 
for analyzing the transient, the steady state, and the tracking performance of adaptive schemes is 
the so-called energy conservation method, developed in [82] and later on extended in [3,105]. More 
on the performance analysis of the LMS as well as of other Online schemes can be found in more 
specialized books and papers [4,13,47,62,83,91,92,96,103]. 

• H°° optimality ofthe LMS. It may come as a surprise that the LMS algorithm, a very simple struc¬ 
ture, has survived the time and is one of the most popular and widely used schemes in practical 
applications. The reason is that, besides its low complexity, it enjoys the luxury of robustness. An 
alternative optimization flavor of the LMS algorithm has been given via the theory of H°° optimiza- 
tion for estimation. 

Assume that our data obey the regression model of (5.38), where now no assumption is made on 
the nature of r). Given the sequence of output observation samples, yo, yi,.. . , yu- 1 , the goal is to 
obtain estimates, s n \ n ~\, of s n , generated as s n — 0 T x n , based on the training set up to and including 
time n — 1 (causality), such that 


iV— 1 
2^n =0 


\Sn\n — 1 $n I 


^ 11^1 I 2 + ££L“oM 2 


< Y 


2 


(5.61) 
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The numerator is the total squared estimation error. The denominator involves two terms. One is 
the noise/disturbance energy, and the other is the norm of the unknown parameter vector. Assuming 
that one starts iterations from 0 _ i =0, this term measures the energy of the disturbance from the 
initial guess. It turns out that the LMS scheme is the one that minimizes the following cost: 



Moreover, it turns out that the optimum corresponds to t = I [46,83]. Note that, basically, the 
LMS optimizes a worst-case scenario. It makes the estimation error minimum under the worst (max¬ 
imum) disturbance circumstances. This type of optimality explains the robust performance of the 
LMS under “nonideal” environments, which are often met in practice, where a number of modeling 
assumptions are not valid. Such deviations from the model can be accommodated in t ] n , which then 
“loses” its i.i.d., white, Gaussian, or any other mathematically attractive property. 

Finally, it is interesting to point out the similarity of the bound in (5.61) to that in (5.56). Indeed, in 
the former we make the following substitutions: 



Ignoring the values of the constants, the involved quantities are the same. After ali, H°° is about 
maximizing the worst-case scenario [55]. 


5.6 THE AFFINE PROJECTION ALGORITHM 


As will soon be verified in the simulations section, a major drawback of the basic LMS scheme is 
its fairly slow convergence speed. In an attempt to improve upon it, a number of variants have been 
proposed over the years. The affine projection algorithm (APA) belongs to the so-called data reusing 
family, where, at each time instant, past data are reused. Such a rationale helps the algorithm to “learn 
faster” and improve the convergence speed. However, besides the increased complexity, the faster con¬ 
vergence speed is achieved at the expense of an increased misadjustment level. 

The APA was proposed originally in [48] and later on in [72]. Let the currently available estimate 
be 6 n -[. According to APA, the updated estimate, 0 , must satisfy the following constraints: 


x T n _ i 0=y n - i , i — 0,1, • • • i <7 — L 


In other words, w eforce the parameter vector 0 to provide at its output the desired response samples, 
for the q most recent time instants, where q is a user-defined parameter. At the same time, APA requires 
0 to be as close as possible, in the Euclidean norm sense, to the current estimate, 0 n -\. Thus, APA, at 
each time instant, solves the following constrained optimization task: 


0 n = argmin |\0 -0„_i|| 2 
0 

s.t. xl_ t 0 = y n -i, i=0,l,...,q-l. 


(5.62) 
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If one defines the q x l matrix 

x l 

- X n-q+ 1_ 

then the set of constraints can be compactly written as 

X n 0 = y n , 

where 

y n = lyn ■■■y„- q +i] T ■ 

Using Lagrange multipliers in (5.62) (Appendix C) results in (Problem 5.1 1) 

0 n =0 n -i+xZ (x n xZy l e n , 

— yn X n O n —l , 

provided that X n Xj t is invertible. The resulting scheme is summarized in Algorithm 5.2. 

Algorithm 5.2 (The affine projection algorithm). 

• Initialize 

- x-i = ... = X- q+ \ =0, y-i ...y- q+l — 0 

- 0-i — 0 e K 7 (or any other value). 

- Choose 0 < /x < 2 and S to be small. 

• For n = 0,1,..., Do 

~ &n = yn X n 0 n — 1 

- 0„ = 0n -1 + iixl (5/ + X n X T n y X e n 

• End For 

When the input is a time series, the corresponding input vector, denoted as u n , is initialized by 

setting to zero ali required samples with negative time index, n_ i, w_ 2 ,_Note that in the algorithm, 

a parameter, 8 , of a small value has also been used to prevent numerical problems in the associated 
matrix inversion. Also, a step size /x has been introduced, which Controls the size of the update and 
whose presence will be justified soon. The complexity of APA is increased compared with that of 
the LMS, due to the involved matrix inversion and matrix operations, requiring 0{q 3 ) MADs. Fast 
versions of the APA, which exploit the special structure of for the case where the involved 

input-output variables are realizations of stochastic processes, have also been developed (see [42,43]). 

The convergence analysis of the APA is more involved than that of the LMS. It turns out that 
provided that 0 < /x < 2, stability of the algorithm is guaranteed. The misadjustment is approximately 


( 5 . 63 ) 

( 5 . 64 ) 
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given by [2,34,83] 



\ 1 ] 


M ~ ' E 


tracej E x \ : misadjustment for the APA. 

2 — p. 

Lll x «ll-J 



In words, the misadjustment increases as the parameter q increases; that is, as the number of the reused 
past data samples increases. 


GEOMETRIC INTERPRETATION OF APA 

Let us look at the optimization task in (5.62), associated with APA. Each one of the q constraints defines 
a hyperplane in the /-dimensional space. Hence, since 6 n is constrained to lie on all these hyperplanes, 
it will lie on their intersection. Provided that x„_/, i = 0,..., q — 1, are linearly independent, these 
hyperplanes share a nonempty intersection, which is an affine set of dimension l — q. An affine set is 
the translation of a linear subspace (i.e., a plane Crossing the origin) by a constant vector; that is, it 
defines a plane in a general position. Thus, 0 n can lie anywhere in this affine set. From the infinite 
number of points lying in this set, APA selects the one that lies closest, in the Euclidean distance sense, 
to 0 n - 1 . In other words, 0 n is the projectiori of Q,,-\ on the affine set defined by the intersection of the 
q hyperplanes. Recall from geometry that the projection Ph (a) of a point a on a linear subspace/affine 
set H is the point in H whose distance from a is minimum. Fig. 5.13 illustrates the geometry for the 
case of q — 2; this special case of APA is also known as the binormalized data reusing LMS [7]. 

In the ideal noiseless case, the unknown parameter vector would lie in the intersection of all the 
hyperplanes defined by (y„,x„), n — 0, 1,..., q — 1, and this is the information that APA tries to 
exploit to speed up convergence. However, this is also its drawback. In any practical system, noise is 
present; thus forcing the updates to lie in the intersection of these hyperplanes is not necessarily good, 
since their position in space is also determined by the noise. As a matter of fact, the reason that /i is 
introduced is to account for such cases. An alternative technique, which exploits projections, and at the 
same time replaces hyperplanes by hyperslabs (whose width depends on the noise variance) to account 
for the noise, will be treated in Chapter 8. In addition, there, no matrix inversion will be required. 


0RTH0G0NAL PROJECTIONS 

Projections and projection matrices/operators play a crucial part in machine learning, signal Process¬ 
ing, and optimization in general; after all, a projection corresponds to a minimization task when the loss 
is interpreted as a “distance.” Let A be an / x k, k < /, matrix with column vectors, a/, i = 1,..., k, 
and x an /-dimensional vector. The orthogonal projection of x on the subspace spanned by the columns 
of A (assumed to be linearly independent) is given by (Appendix A) 


P [aii (x) = A(A T Ar l A T x, 


(5.65) 


where in complex spaces the transpose operation is replaced by the Hermitian one. One can easily 
check that P^ j(jc) := (/ — Pj a ,})jc is orthogonal to Pq^fx) and 


x= P [ai) (x)+ P^fx). 
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FIGURE 5.13 

The geometry associated with the APA, for q = 2 and 1 = 3. The intersection of the two hyperplanes is a straight 
line (affine set of dimension 3 — 2=1); 9 n is the projection of 0„-i on this line (point 1 ) for p, = 1 and <5 = 0. Point 
2 corresponds to the case /z < 1. Point 3 is the projection of 0„ on the hyperplane defined by (y„,x n ). This is the 
case for q = 1. The latter case corresponds to the normalized LMS (NLMS) of Section 5.6.1. 


When A has orthonormal columns, we obtain the (familiar from geometry) expansion 

k 

P [ai }(x) = AA t x = y^ / {ajx)a i . 

i=i 

Thus, the factor (A T A) _1 , for the general case, accounts for the lack of orthonormality of the columns 
of A. The matrix A(A 7 A) -1 A 7 is known as the respective projection matrix and / — A(A 7 A) -1 A 7 
as the projection matrix on the respective orthogonal complement space. 

The simplest case occurs when k = 1; then the projection of x onto a\ is equal to 

a\a\ 

llflill 2 *’ 


fjailM = 
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FIGURE 5.14 

Geometry indicating the orthogonal projection operation on the subspace spanned by «i. ai, using the projection 
matrix. Note that A — [a\, bl¬ 


anci the corresponding projection matrices are given by 




ciia\ 

ll«ill 2 ’ 


«IA 

ll«ill 2 


Fig. 5.14 illustrates the geometry. 

Let us apply the previously reported linear algebra results in the case of the APA of (5.63) and 
(5.64), and rewrite them as 


B n = {l -Xl{X n xlr l X^9 n -x 

+ xl(X n xl)~ l y n . 

The first term on the right-hand side is the projection x 11 I )■ This is the most natural. 

By the definition of the respective affine set, as the intersection of the hyperplanes 

- y n -i =0, i = 0 ,. .., q - 1 , 

each vector is orthogonal to the respective hyperplane (Fig. 5.13) (Problem 5.12). Hence, project- 
ing 6 n -\ on the intersection of ali these hyperplanes is equivalent to projecting on an affine set, which 
is orthogonal to ali x n ,, x n - q +\. Note that the matrix Xj(X„Xj )~ l X n is the projection matrix 
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FIGURE 5.15 

A 9 is equal to the (signed) distance of 9 n -i from the plane times the unit vector j|| a |j-; x n should be at the origin, 
but it is drawn so that it is clearly shown that it is perpendicular to the line segment. 


that projects on the subspace spanned by x „,..., x„- q +\. The second term accounts for the fact that 
the affine set on which we project does not include the origin, but it is translated to another point in 
the space, whose direction is determined by y n . Fig. 5.15 illustrates the case for 1 = 2 and q = 1. Be- 
cause 0 n-\ does not lie on the line (plane) x„0 — y n = 0, whose direction is defined by x n , we know 

from geometry (and it is easily checked) that its distance from this line is s = " ^‘ ~" . Also, 6 n -[ 

lies on the negative side of the straight line, so that x l n 0,,-t — y„ < 0. Hence, taking into account the 
directions of the involved vectors, it turns out that 

y n %n^n—\ 

I \Xn 11 IIXJI' 


Thus, for this specific case, the correction 


0 n — 0 n —\ + 


coincides with the recursion of the APA. 

5.6.1 THE NORMALIZED LMS 

The NLMS is a special case of the APA corresponding to q = 1 (see Fig. 5.13). We treat it separately 
due to its popularity, and it is summarized in Algorithm 5.3. 
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Algorithm 5.3 (The normalized LMS). 

• Initialization 

- 6 _ 1 = 0 e R 7 , or any other value. 

- Choose 0 < jJL < 2, and S a small value. 

• For n — 0, 1,2,... , Do 

- Cn = yn @ n —\Xn 

0n = 0«-l + s +xT Xn X n e n 

• End For 

The complexity of the NLMS is 3/ MADs. Stability of the NLMS is guaranteed if 

0 < p <2. 

One can look at the NLMS as an LMS whose step size is left to vary with the iterations, as represented 
by 

Mn — „ j > 

s + x‘ n x n 

which turns out to have a beneficial effect on the convergence speed, compared with the LMS. More 
on the performance analysis of the NLMS can be found in, e.g., [15,83,90]. 

Remarks 5.3. 

• To deal with sparse models, the so-called proportionate NLMS and related versions of the other 
algorithms have been derived [11,35]. The idea is to use a separate step size for each one of the 
parameters. This gives the freedom to the coefficients which correspond to small (or zero) values 
of the model to adapt at a different rate than the rest, and this has a significant effect on the conver¬ 
gence performance of the algorithm. Such schemes can be considered as the ancestors of the more 
theoretically elegant sparsity promoting Online algorithms, to be treated in Chapter 10. 

• Another trend that has received attention more recently is to appropriately (convexly) combine the 
outputs of two (or more) learning structures. This has the effect of decreasing the sensitivity of 
the learning algorithms to choices of parameters such as the step size, or to the dimensionality of 
the problem (size of the filter). The two (or more) algorithms run independently and the mixing 
parameters of the outputs are learned during the training. In general, this approach turns out to be 
more robust in the choice of the involved user-defined parameters [8,81]. 

• In some papers, the user-defined parameter, <5, in the NLMS and the APA is suggested to be given 
a very small positive value. Often, the explanation for this option is that the use of 8 is to avoid 
division by zero. However, in practice, there are cases, such as the echo cancelation task, where <5 
needs to be set to quite large values (even larger than 1) to attain good performance. It is fairly recent 
that the importance of S was emphasized and a formula for its proper tuning was proposed both for 
the case of NLMS and APA and their proportionate counterparts [12,73]. There, it is indicated that 
without the proper setup of this parameter, the performance of these algorithms may be significantly 
affected, and they may not even converge. 
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5.7 THE COMPLEX-VALUED CASE 

In Section 4.4, when 

y„eC and x n eC\ 

the widely linear formulation of the estimation task was introduced to deal with complex-valued data 
that do not obey the circularity conditions. The output of a widely linear estimator is given by 

y« = 


where 


(p 


0 

and x„ = 

x n 

V 


X* 

L A /? J 


with O.veC. The MSE cost function is 


J(<p) — E 


y« 


H~ |2l 
~<P X„ | I , 


and following Standard arguments analogous to Section 5.3.1, the minimum with respect to <p is given 
by the root of 


£i<P - 



= 0 or E [x„xf j (p — E [x„y*]. 


THE WIDELY LINEAR LMS 

Employing the Robbins-Monro scheme and fixing the value of /x„ to be a constant, we obtain 

<Pn=<Pn-l+»X n e*, 

= yn (P n —\Xn • 

Breaking <p n and x n into their components, the widely linear LMS results. 

Algorithm 5.4 (The widely linear LMS). 

• Initialize 

- 0-i — 0, r_i = 0 

- Choose fi. 

• For n = 0,1,..., Do 

- e n = y n -0^_ l x n -v^_ 1 x* 

- 0 n — 0f t — i + 

- v„ = d„_i + fix* e* 

• End For 

Note that whatever has been said for the stability conditions concerning the LMS is applied here as 
well, if E x is replaced by £ x . For circularly symmetric variables, we set v n = 0 and the complex linear 
LMS results. 
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THE WIDELY LINEAR APA 

Let cp n and x„ be defined as before. The widely linear APA results; it is given in Algorithm 5.5 (Prob- 
lem 5.13). 

Algorithm 5.5 (The widely linear APA). 

• Initialize 

- <P- 1=0 

- Choose fi. 

• For n = 0, 1, 2, ..., Do 

- e* n =y* n -X n <p n _ x 

- cp n =( Pn _ l + f JiX«(SI + X n X»r 1 e* n 

• End For 

Note that 


X n = 


-H 


-n—q+1. 


X H X T 
■*n ’ ■*'/! 


X H X T 

. n—q+V x n-q+ 1. 


and <p n e C 2 . For circular variables/processes, the complex linear APA results by setting 


X n = X n = 


.H 


X H 
Lr n—q+U 


and 

<Pn =0n eC'. 


5.8 RELATIVES 0F THE LMS 

In addition to the three basic stochastic gradient descent schemes that were previously reviewed, a 
number of variants have been proposed over the years, in an effort to either improve performance or 
reduce complexity. Some notable examples are described below. 

THE SIGN-ERROR LMS 

The update recursion for this algorithm becomes (e.g., [17,30,63]) 

0« =0/,-i +Mcsgn[e*]*„, 
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where the complex sign of a complex number, z — x + jy, is defined as 

csgn(z) = sgn(.t) + j sgn(y). 


If in addition p is chosen to be a power of two, then the recursion becomes multiplication-free, and l 
multiplications are only needed for the computation of the error. It turns out that the algorithm mini- 
mizes, in the stochastic approximation sense, the following cost function: 


J(0) — E 


|y — 0 H x\ 


and stability is guaranteed for sufficiently small values of /x [83]. 


THE LEAST-MEAN-FOURTH (LMF) ALGORITHM 

The scheme minimizes the following cost function: 


J(0)=E 


y 



and the corresponding update recursion is given by 

9» — 9n — \ "E I ^n^rr 


It has been shown [101] that minimization of the fourth power of the error may lead to an adaptive 
scheme with better compromise between convergence rate and excess MSE than the LMS if the noise 
source is sub-Gaussian. In a sub-Gaussian distribution, the tails of the PDF graph are decaying at 
a faster rate compared to the Gaussian one. The LMF could be seen as a version of the LMS with 
time-varying step size, equal to /x | e„ \ 2 . Hence, when the error is large, the step size increases, which 
helps the LMF to converge faster. On the other hand, when the error is small, the equivalent step size 
is reduced, leading to lower values of the excess MSE. This idea turns out to work well when the noise 
is sub-Gaussian. However, the LMF tends to become unstable in the presence of outliers. This is also 
understood, because for very large error values the equivalent step size becomes very large, leading 
to instability. This is, for example, the case when the noise follows a distribution with high tails; that 
is, when the corresponding PDF curve decays at slower rates compared to the Gaussian, such as in 
the case of super-Gaussian distributions. Results concerning the analysis of the LMF can be found in 
[69,70,83]. Robustness and stability of the algorithm are guaranteed for sufficiently small values of /x 
[83]. 


TRANSFORM-DOMAIN LMS 

We have already commented that the convergence speed of the LMS heavily depends on the condition 
number j G f the covariance matrix; this will soon be demonstrated in the examples below. 

Transform-domain techniques exploit the decorrelation properties of certain transforms, such as 
the DFT and DCT, in order to decorrelate the input variables. When the input comprises a stochastic 
process, we say that such transforms “prewhiten” the input process. Moreover, in the case where the 
involved variables are part of a random process, one can exploit the time-shifting property and employ 
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block processing techniques ; by processing a block of data samples per time instant, one can exploit 
efficient implementations of certain transforms, such as the fast Fourier transform (FFT), to reduce 
the overall complexity. Such schemes are appropriate for applications where long filters are involved. 
For example, in some applications, such as echo cancelation, filter orders of a few hundred taps are 
commonly encountered [10,39,53,66,68]. 

Let fbea unitary transform in the complex domain, represented byTT H — T H T = I. Dehne 



(5.66) 


and apply the transform matrix T H on both sides of the respective recursion in Algorithm 5.4, for the 
widely linear LMS, to obtain 


0 n — 0 n —\ + ^tx n e n , 


(5.67) 


where 


K = T H 6 n . 


(5.68) 


Note that 



Hence, the error term is not affected. Note that until now, we have not achieved much. Indeed, 



is a similarity transformation that does not affect the condition number of the matrix S x ; it is known 
from linear algebra that both matrices share the same eigenvalues (Problem 5.14). Let us now choose 
T — Q, where Q is the unitary matrix comprising the orthonormal eigenvectors of S x . For this case, 
Z x = A, which is the diagonal matrix with entries the eigenvalues of E x . In the sequel, we modify the 
transform-domain LMS in (5.67), so that we use a different step size per component, according to the 
following scenario: 



,* 
n ’ 


(5.69) 


or 


0n — 0 n — i + p.x n e n , 


(5.70) 


where 


On := A l/2 0n 


and 
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Note that the error is not affected, because 0 n x n — 0 n x n . We have now achieved our original goal, 
since 

S- x = A-V^A- 1 / 2 = A- 1 / 2 AA“ 1 /2 = /. 

That is, the condition number of IA is equal to 1. In practice, this technique is difficult to apply, due 
to the complexity associated with the eigendecomposition task; more importantly, S x must be known, 
which in adaptive implementations is not the case. In practice, we resort to unitary transforms, T, that 
approximately whiten the input, such as the DFT and DCT. Then (5.69) is replaced by 

0 n =0„-i + ^D~ l x n e*, 


where D is the diagonal matrix with entries the variances of the respective components of x„, or 


[D]„ =E[(x n (0) 2 



i = 1,2,...,/, 


where x„(0 is the /th entry of x„, which has been assumed to be a zero mean vector; this is justified 
from the fact that if S x is truly diagonal and equal to A, the eigenvalues correspond to the variances of 
the respective elements. The entries u 2 are estimated in a time-adaptive fashion, and the scheme given 
in Algorithm 5.6 results. 

Algorithm 5.6 (The transform-domain LMS). 

• Initialization 

- 0-i — 0; or any other value. 

- o" 2 j(i) = 5, i = 1,2,...,/; S a small value. 

- Choose fi, and 0 <$C /Sci. 


• For n = 0. 1, 2, ..., Do 

- X„ = T H X n 

~ = yn v n _\X /] 

- 0 n =9n-l +/xD~ 1 X„e* 

- For i = l,2,...,/, Do 

* of(n) = fiafin - 1 ) + (1 - P)\x n (i)\ 2 

- End For 


- D — diag{cr?(n)} 

• End For 


Subband adaptive filters is a related family of algorithms where whitening is achieved via block 
data processing and the use of a multirate filter band. Such schemes were first proposed in [41,53,54] 
and have successfully been used in applications such as echo cancelation [45,53]. 

Another approach that has been adopted to improve the convergence rate, for the case of system 
identification applications, is to select the input excitation signal to be of a specific type. For example, 
in [5] it is pointed out that the excitation signal that optimizes the convergence speed of the NLMS algo- 
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rithm is a deterministic perfect periodic sequence (PPSEQ) with period equal to the impulse response 
of the system. Such sequences have been used in [24,25] to derive versions of the LMS which not only 
converge fast but are also of very low computational complexity, requiring only one multiplication, one 
addition, and one subtraction per update. 


5.9 SIMULATION EXAMPLES 

Example 5.3. The goal of this example is to demonstrate the sensitivity of the convergence rate of 
the LMS to the eigenvalues spread of the input covariance matrix. To this end, two experiments were 
conducted, in the context of a regression/system identification task. Data were generated according to 
our familiar model 

) fi —^o %n "E ■ 

The (unknown) parameters 0 o e R 10 were randomly chosen from Af(0, 1) and then frozen. In the first 
experiment, the input vectors were formed by a white noise sequence with samples i.i.d. drawn from 
Af(0, 1). Thus, the input covariance matrix was diagonal with ali the elements being equal to the 
corresponding noise variance (Section 2.4.3). The noise samples rj„ were also i.i.d. drawn from a 
Gaussian with zero mean and variance a 2 = 0.01. In the second experiment, the input vectors were 
formed by an AR(1) process with coefficient equal to a\ — 0.85 and the corresponding white noise 



FIGURE 5.16 


Observe that for the same step size, the convergence of the LMS is faster when the input is white. The two curves 
are the resuit of averaging 100 independent experimental realizations. 
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excitation was of variance equal to 1 (Section 2.4.4). Thus, the input covariance matrix is no more 
diagonal and the eigenvalues are not equal. The LMS was run on both cases with the same step size /i = 
0.01. Fig. 5.16 summarizes the results. The vertical axis (denoted as MSE) shows the squared error, 
e„, in dBs ( 1 0 log 10 ej t ) and the horizontal axis shows the time instants (iterations) n. Note that both 
curves level out at the same error floor. However, the convergence rate for the case of the white noise 
input is significantly higher. The curves shown in the figure are the resuit of averaging 100 independent 
experimental realizations. 

It must be emphasized that when comparing convergence performance of different algorithms, ei- 
ther ali algorithms should converge to the same error floor and compare the respective convergence 
rates, or ali algorithms should have the same convergence rate and compare respective error floors. 

Example 5.4. In this example, the dependence of the LMS on the choice of the step size is demon- 
strated. The unknown parameters 0 o e R 10 and the data were exactly the same as in the white noise 
case of Example 5.3. 

The LMS was run using the generated samples, with two different step sizes, namely, /r = 0.01 and 
/i = 0.075. The obtained averaged (over 100 realizations) curves are shown in Fig. 5.17. Observe that 
the larger the step size, for the same set of observation samples, the faster the convergence becomes 
albeit at the expense of a higher error floor (misadjustment), in accordance to what was discussed in 
Section 5.5.1. 



FIGURE 5.17 


For the same input, the larger the step size for the LMS, the faster the convergence becomes, albeit at the expense of 
higher error floor (MSE in dBs). 
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FIGURE 5.18 


Observe that for the same step size, the convergence of the transform-domain LMS is significantly faster compared 
to the LMS, for similar error floors. The higher the eigenvalues spread of the input covariance matrix is, the more 
the obtained performance improvement becomes (MSE in dBs). 


Example 5.5 (LMS versus transform-domain LMS). In this example, the stage of the experimental 
setup is exactly the same as that considered in Example 5.3 for the AR(1) case. The goal is to compare 
the LMS and the transform-domain LMS. Fig. 5.18 shows the obtained averaged error curves. The step 
size was the same as the one used in Example 5.3, /x = 0.01; hence the curve for the LMS is the same 
as the corresponding one appearing in Fig. 5.16. Observe the significantly faster convergence achieved 
by the transform-domain LMS, due to its (approximate) whitening effect on the input. 

Example 5.6. The experimental setup is similar to that of Example 5.4, with the only exception that 
the unknown parameter vector was of higher dimension, 0 o e R 60 , so that the differences in the per¬ 
formance of the algorithms is more ciear. The goal is to compare the LMS, the NLMS, and the APA. 
The step size of the LMS was chosen equal to /x = 0.025 and for the NLMS /x = 0.35 and S — 0.001, 
so that both algorithms had similar convergence rates. The step size for the APA was chosen equal to 
/x = 0.1, so that q = 10 will settle at the same error floor as that of the NLMS. For the APA, we also 
chose S — 0.001. The results are shown in Fig. 5.19. 

Observe the lower error floor, for the same convergence rate, obtained by the NLMS compared 
to the LMS and the improved performance obtained by the APA for q = 10. Increasing the number 
of past data samples (re)used in APA to q — 30, we can see the improved convergence rate that is 
obtained, although at the expense of higher error floor, as predicted by the theoretical results reported 
in Section 5.6. 
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FIGURE 5.19 

For the same step size, the NLMS converges at the same rate to a lower error floor compared to the LMS. For the 
APA, increasing q improves the convergence, at the expense of higher error floors (MSE in dBs). 


5.10 ADAPTIVE DECISION FEEDBACK EQUALIZATION 

The task of channel equalization was introduced in Fig. 4.12. The input to the equalizer is a stochas- 
tic process (random signal), which, according to the notational convention introduced in Section 2.4, 
will be denoted as u„. Note that upon receiving the noisy and distorted by the (Communications) chan¬ 
nel sample, u n , one has to obtain an estimate of the originally transmitted information sequence, ,v„, 
delayed by L time lags, which accounts for the various delays imposed by the overall transmission 
system involved. Thus, at time n, the equalizer decides for s n -L+\. Ideally, if one knew the true values 
of the initially transmitted information sequence up to and including time instant n — L, represented by 
s n - l, Sn-L-i, Sn-L-2 ■ ■ ■, it could only be beneficial to use this information, together with the received 
sequence, u n , to recover an estimate of s„-l+ 1 . This idea is explored in the decision feedback equalizer 
(DFE). The equalizer’s output, for the complex-valued data case, is now written as 


where 


L -1 


/-1 


dyi — ^ ' U }• U n —i -p ^ ' Wj Sfj—L—i 


i =0 

= W H U 


i =0 


w 


w' 


w' 


: C L+/ , 


Ue.n • — 


\C L+l , 


(5.71) 
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FIGURE 5.20 

The forward part of the DFE acts on the received samples, while the backward part acts on the training data/deci- 
sions, depending on the mode of operation. 


and s n := [s n -L, .... s„-l~i+\ ] T . The desired response at time n is 

dn — Sn—L+\• 

In practice, after the initial training period, the information samples s n -L-i are replaced by their com- 
puted estimates, s„-L-i , i = 0,1— 1, which are available from decisions taken in previous time 
instants. It is said that the equalizer operates in the decision-directed mode. The basic DFE structure 
is shown in Fig. 5.20. Note that during the training period, the parameter vector, w, is trained so as to 
minimize the power of the error, 


— d n d n — S n —L +1 d n ■ 

Once all the available training samples have been used, training carries on using the estimates 1 . 

For example, for a binary information sequence s n e {1, —1}, the decision concerning the estimate 
at time n is obtained by passing d„ through a threshold device and s„-l+ i is obtained. Note that 
DFE is one of the early examples of semisupervised learning, where training data are not enough, and 
the estimates are used for training [94]. In this way, assuming that at the end of the training phase 
1 = s h -l+ i, and also that time variations are slow, so as to guarantee that d n ~ d n , we expect, 
with good enough probability, that 1 will remain equal to s n -L+ 1, so that the equalizer can track 
the changes. More on DFEs and their error performance can be found in [77]. 

Any of the adaptive schemes treated so far can be used in a DFE scenario by replacing in the 
input vector, u e , the term s„ by s n , when operating in the decision-directed mode. Note that adaptive 
algorithms in the context of the equalization task were employed first in [44,78]. A version of the DFE, 
operating in the frequency domain, has been proposed for the first time in [14]. 

Thus the FMS recursion for the linear DFE, in its complex-valued formulation, becomes 

d n = Wn_ { U e ,n 

d n — s n -L+ 1! in the training mode, or 
d n = T \d„ ; in the decision directed mode, 
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FIGURE 5.21 

The MSE in dBs for the DFE of Example 5.7. After the time instant n = 250, the LMS is trained with the decisions 

s n— L+l - 


e n — dn d n , 

w n — Wfi—i T fiu e n e n ^ 

where T[-] denotes the thresholding operation. 

Example 5.7. Let us consider a communication system where the input information sequence com- 
prises a stream of randomly generated symbols s n = ±1, with equal probability. This sequence is sent 
to a channel with impulse response, 

h = [0.04, -0.05,0.07, -0.21,0.72,0.36,0.21,0.03,0.07] r . 

The output of the channel is contaminated by white Gaussian noise at the SNR = 11 dB level. A DFE 
is used with the length of the feed-forward section being equal to L — 21 and the length of the feed- 
back part equal to / = 10. The DFE was trained with 250 symbols; then, it was switched on to the 
decision-directed mode and it was run for 10,000 iterations. At each iteration, the decision (sgn (<:/„)) 
was compared with the true transmitted symbol s n -L+ 1 - The error rate (total number of errors over the 
corresponding number of transmitted symbols) was approximately 1%. Fig. 5.21 shows the averaged 
MSE curve as a function of the number of iterations. For the LMS, we used /i — 0.025. 
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5.11 THE LINEARLY CONSTRAINED LMS 

The task of linearly constrained MSE estimation was introduced in Section 4.9.2. Here, we turn our 
attention to its Online stochastic gradient counterpart. The discussion will be carried out within the 
notational convention used for linear filtering involving processes, since a typical application is that of 
beamforming; this is also the reason that we will involve the complex-valued formulation. However, 
everything to be said carries on to the more general linear estimation task. 

In Section 4.9.2, the goal was to minimize, either the noise variance or the output of the filter 
(beamformer), subject to a set of constraints, leading to the costs w H E^w or w H E u w. respectively. In 
a more general setting, we will require the output to be “close” to a desired response random signal d„. 
For the beamforming setting, this corresponds to a desired (training) signal; that is, besides the desired 
direction, which is provided via the constraints, we are also given a desired signal sequence. Moreover, 
we will assume that our solution has to satisfy more than one constraint. The task now becomes 


io* = argminE 


s.t. 


W 


in ~ 


w Cj = gj, i = 1,2,..., m, 


for some gj e R. The corresponding Lagrangian is 


L{w) 


crj + w H E[u„u ; f ]io - w H E[u„d*] - E[uf d„]u) 

m 

+ Yl Xi ( wHc ' ~Si) > 

i'=l 


(5.72) 


where k,-, i = 1,2,..., m, are the corresponding Lagrange multipliers. Taking the gradient with respect 
to io* and considering io to be a constant, we get 


m 

V w *L{w) = E[u„u,f]io -E[u„d*] + ^k,c/. 

/= 1 


Applying the Robbins-Monro scheme to find the root, we get 

io„ = u)„_!+/rM„e*-/rCk„, (5.73) 

where we have used a constant step size /r, 


dn = i Un_\U n , 


and 


e n — dfi d n , 


and we have allowed k to be time dependent; C is defined as the matrix having as columns the vectors 
Cj , i = 1,2,..., /n. 
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Plugging (5.73) into the constraints in (5.72), which can be compactly written as C H w = g (recall 
gi e R), we readily obtain 

= (C H C)~ l g - (C H C)- l C H ( Wn -i + fiu n e*), 

and the update recursion becomes 

w„ = (/ - C(C H Cr l C H ) (w n -i + Iiu n e* ) + C (c"c)~‘ g. 

Note that (/ — C(C H C)~ l C H ) is the orthogonal projection matrix on the intersection (affine set) of the 
hyperplanes defined by the constraints (recall that C(C H C)~ l C H is the respective projection matrix 
on the subspace spanned by c,- , i = 1 ,2,..., m). In the case the goal is to minimize the output, one has 
to set in e n the desired response d n — 0. 

The constrained LMS was first treated in [40]. Besides the previously used constraints, other con¬ 
straints can also be used; for example, for the constrained NLMS, one additionally demands 

u>„u„ =d n 


(see, e.g., [6]). 


5.12 TRACKING PERFORMANCE OF THE LMS IN N0NSTATI0NARY 
ENVIRONMENTS 

We have already considered the convergence behavior of the LMS and made related comments for the 
other algorithms that have been discussed. As stated before, convergence is a transierit phenomenon; 
that is, it concerns the period from the initial kick-off point to the steady state. The steady state was also 
discussed for stationary environments, that is, environments in which the unknown parameter vector as 
well as the underlying statistics of the involved random variables/processes remain unchanged. 

We now turn our focus to cases where the true (yet unknown) parameter vector/system undergoes 
changes. Thus, this affects the output observations and consequently their statistics. Note that the statis¬ 
tics of the input can also change. However, we are not going to consider such cases, as the analysis can 
become quite involved. Our goal is to study the tracking performance of the LMS, that is, the ability 
of the algorithm to track changes of the unknown parameter vector. Note that tracking is a steady-state 
phenomenon. In other words, we assume that enough time has elapsed so that the influence of the ini¬ 
tial conditions has been forgotten and has no effect, any more, on the algorithm. Tracking agility and 
convergence speed are two different properties of an algorithm. An algorithm may converge fast, but it 
may not necessarily have a good tracking performance, and vice versa. We will see such cases later on. 

The setting of our discussion will be similar to that of Section 5.5.1. In conformity with the dis- 
cussion there, we consider the real-data case; similar results hold true for the complex-valued linear 
estimation scenario. However, in contrast to the adopted model in (5.38), a time-varying model is 
adopted here, using the following assumptions. 

Assumption 1. The output observations are generated according to the model 


yn = e J,«-l x + T l, 


( 5 . 74 ) 
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which is in line with the prediction model used in LMS, at time n. That is, the unknown set of parame- 
ters is a time-varying one. The statistical properties of the input vector x, as well as the noise variable r), 
are assumed to be time independent, and this is the reason we have not used the time index; equiva- 
lently, in the case where the input is a random process, u„, it is assumed to be stationary. Moreover, the 
input variables are assumed to be independent of the zero mean noise variable, r). Furthermore, suc¬ 
cessive samples of r) are i.i.d. (white noise sequence) of variance a 2 . So far, we have not gone much 
beyond Assumption 1, stated in Section 5.5.1. 

Assumption 2. The time-varying model follows a random walk variation, represented by 

®o,« — 0 o,h —1 “E (5.75) 

The random vector w is assumed to be zero mean with covariance matrix 

E j^wo) 7 j = E m . 

Note that the variance of a random walk grows unbounded with time; this is readily shown by applying 
(5.75) recursively. 

A variant of this model would be more sensible to use, 


0o,n — 1 "E tO, 

with |a| < 1 [106]. However, the analysis gets more involved, so we will stick with the model in (5.75). 
After ali, our goal here is to highlight and have a first touch on the notion of tracking and to get an idea 
of its effects on the misadjustment in steady state. 

Assumption 3. As in Section 5.5.1, we assume that c„_i := 0„_i — 0 o ,„-i is independent of x and r). 
This time, we will also assume independence of c„ and &>. 

Recall from (5.48) that the excess MSE at time n is given by 


7exc ,n — tracej E x E cn -\ |. 


Thus, our goal now is to compute S c n -\ for the time-varying model case. It is straightforward to see 
that the counterpart of (5.35) now becomes 






c„_i + /xxq - G). 


(5.76) 


Adopting the previously stated three assumptions, as well as the Gaussian assumption for x, and fol- 
lowing exactly the same steps used for (5.41), we end up with 


r 


c,n 


^c,n—\ ft^Jx ^c,n— 1 2/X U x ^c,n—l^x 

+P 2 E x tmce{ E x XL.n-i} + p. 2 o 2 E x + E m . 


(5.77) 


Note that if complex data are involved, the only difference is that the fourth term on the right-hand side 
is not multiplied by 2 (Problem 5.15). This equation governs the propagation of E cn , which in turn 
can provide the excess MSE error. 

A more convenient form results for small values of /x, where the fourth and fifth terms on the 
right-hand side can be neglected, being small with respect to /iE x E cn -\. Moreover, at the steady 
state, E c n = E c n _\ := E c , and taking the trace on both sides, we end up with (recall trace{A + B} = 
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tracej A} + tracej B\ and tracej AB} — tracej BA}) 


/exc = tracej tracejX*} + -trace{X w } 


(5.78) 


Note that this is exactly the same approximation as the one resulting from the more sound theory of 
energy conservation [83]. 

Compare (5.78) with (5.53); in the current setting, there is another term associated with the noise, 
which drifts the model around its mean. Thus, (a) the excess MSE is contributed by the inability of the 
LMS to obtain the optimum value exactly, and (b) there is an extra term measuring its “inertia” to track 
the changes of the model fast enough. This is the most important outcome of the current discussion. In 
time-varying environments, the misadjustment increases. Moreover, looking at (5.78) and at the effect 
of /x, it is observed that small values of /i have a beneficial effect on the hrst term, but they increase 
the contribution of the second one. The opposite is true for relatively big values of //. This is natural. 
Small step sizes give the algorithm the chance to learn better under stationary environments, but the 
algorithm cannot track the changes fast enough. Thus, the choice of /i should be the outcome of a 
tradeoff. Minimizing the excess error in (5.78), it is easily shown to resuit in 



Note, however, that such choices for /i are of theoretical importance only. In practice, the time variation 
of the system can hardly correspond to that of the adopted model; the latter, due to the complexity of 
the analysis, was chosen to be a simple one in an effort to simplify the mathematical manipulations. 
Moreover, for the sake of simplifying the analysis, a number of assumptions were adopted. In practice, 
/i is chosen more according to the user’s practical experience, after experimentation, than based on the 
theory. The theory, however, has pointed out the tradeoff between the speed of convergence and that of 
tracking. 

More mathematically rigorous analysis of the performance of online/adaptive schemes in nonsta- 
tionary environments can be obtained from [16,38,47,61,70,83]. Simulation results demonstrating the 
tracking performance of the LMS compared with other algorithms are given in Example 6.3 in Chap- 
ter 6, where the recursive least-squares (RLS) algorithm is presented. 


5.13 DISTRIBUTED LEARNING: THE DISTRIBUTED LMS 


The focus of our attention is now turned toward a problem that has been of an increasing importance 
over the last decade or so. There is a growing number of applications where data are received/reside 
in different sensors/databases, which are spatially distributed. However, ali this spatially distributed 
information has to be exploited towards achieving a common goal, i.e., to perform a common estima- 
tion/inference task. We refer to such tasks as distributed or decentralized learning. At the heart of this 
problem lies the concept of cooperation , which is another name for the process of exchanging learn¬ 
ing experience/information in order to reach a common goal/decision. Human societies have survived 
because of cooperation (and have disappeared due to lack of cooperation). 
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Distributed learning is common in many biological systems, where no individual/agent is in charge, 
yet the group exhibits a high degree of intelligence (we humans refer to it as instinct). Look, for 
example, at the way birds fly in formation and bees swarm in a new hive. 

Besides sociology and biology, Science and engineering have used the concept of distributed learn¬ 
ing; wireless sensor networks (WSNs) are a typical example. WSNs were originally suggested as 
spatially distributed autonomous sensors to monitor physical and environmental conditions, such as 
pressure, temperature, and sound, and to cooperatively pass their data to a Central unit. Although WSNs 
were originally motivated for military applications, today they are targeted at a diverse number of ap- 
plications, such as traffic control, homeland security and surveillance, health care, and environmental 
modeling. Each sensor node is equipped with an onboard processor, in order to perform locally some 
simple processing and transmit the required and partially processed data. Sensors/nodes are charae - 
terized by low processing, memory, and communication capabilities due to low energy and bandwidth 
constraints [1,104]. 

Other typical examples of distributed learning applications are the modeling and study of the way 
individuals are linked in social networks, modeling pathways defined over complex power grids, cog¬ 
nitive radio systems, and pattern recognition; the common characteristic of ali these applications is that 
data are partially processed in each individual node/agent, and the processed information is passed over 
to the network under a certain protocol. 

The obvious question that the unfamiliar reader may ask is why not use only one node/agent and the 
locally residing information to perform the inference task. The answer, of course, is that we can come 
with better estimates/results by exploiting the available data/information across the whole network. 
This brings us to the notion of consensus. 

According to the American Heritage dictionary, “consensus” is defined as “an opinion or position 
reached by a group as a whole”; that is, consensus is the process that guarantees an “accepted agree- 
ment” within the group. The term “accepted agreement” is not uniquely defined. In some cases, this 
may refer to a unanimous decision; in other cases it refers to a majority rule. In some cases, ali the 
opinions of ali agents are equally weighted whereas in others, different weights are imposed, based on 
some relative significance measures. However, in all cases, the essence of any consensus-based process 
is the trust that a “better” decision is reached when compared to the process of each agent/person acting 
individually. 

In this section, we focus on the task of pcirameter estimation. Each individual agent has access 
to partial information via a “local” acquisition process of data. Although each agent has access to a 
different set of data, they all share a common goal, i.e., to estimate the same unknown set of parameters. 
This task will be achieved in a collaborative manner. However, different cooperation scenarios can be 
adopted. 

5.13.1 COOPERATION STRATEGIES 

In distributed learning, each individual agent is represented as a node in a graph. Edges between nodes 
indicate that the respective agents can exchange information. Undirected edges indicate that infor¬ 
mation can be exchanged in both directions, while directed edges indicate the allowed direction of 
information flow. 


More rigorous definitioris on graphs are given in Chapter 15. 
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Centra lized Networks 

Under this scenario of cooperation, nodes communicate their measurements to a Central fusion unit 
for processing. The obtained estimate can be communicated back to each one of the nodes. Fig. 5.22 



FIGURE 5.22 

The square indicates the fusion center. (A) Ali nodes communicate directly to the fusion center. (B) Some nodes 
are connected directly to the fusion center. Others communicate their own data to a neighboring node, and so on, 
until the information reaches the fusion center. The bolder a connection is drawn, the higher the amount of data 
transmitted via the corresponding link. 


illustrates the topology. In Fig. 5.22A all nodes are connected directly to the fusion center, indicated 
by a square. In Fig. 5.22B, some of the nodes can be linked directly to the fusion center, while others 
communicate their measurements to a linked neighbor, which then passes the received as well as the 
locally available observations/measurements either to a neighboring node or to the fusion center. The 
major advantage in this cooperation strategy is that the fusion center can compute optimal estimates, 
since it has access to all the available information. However, the optimality is obtained under a num- 
ber of drawbacks, such as demand for increased communication costs and delays, especially for large 
networks. In addition, when the fusion center breaks down, the whole network collapses. Moreover, 
in certain applications, privacy issues are involved. For example, when data concern medical records, 
the nodes do not wish to send the available (training) data, but it is preferably to communicate cer¬ 
tain locally obtained processed information. To overcome the drawbacks of the centralized processing 
scenario, different distributed processing schemes have been proposed. 

Decentralized Networks 

Under this scenario, there is no Central fusion center. Processing is performed locally at each node, 
employing the locally received measurements, and in the sequel, each node communicates the locally 
obtained estimates to its neighbors, that is, to the nodes it is linked with. These links are denoted as 
edges in the respective graph. There are different decentralized schemes. 
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FIGURE 5.23 

(A) The incremental or ring topology. The information flow follows a cyclic path. (B) Topology corresponding to a 
diffusion strategy. Each node communicates information to the nodes with which it shares an edge. 


• Incremental/'ring networks: These require the existence of a cyclic path following the edges through 
the network. Starting from a node, such a cycle has to visit every node at least once, and then return 
to the first one. Such a topology implements an iterative computational scheme. At each iteration, 
every node performs its data acquisition and processing locally and communicates the required 
information to its neighbor in the cyclic path. It has been shown that incremental schemes achieve 
global performance (e.g., [58]). The main disadvantage of this mode of cooperation is that cycling 
information around at each iteration is a problem in large networks. It is also important to stress at 
this point that the construction and maintenance of a cyclic graph, visiting each node, is an NP-hard 
task [52]. Moreover, the whole network collapses if one node malfunctions. The corresponding 
graph topology is shown in Fig. 5.23A. 

• Ad hoc networks : According to this philosophy of cooperation, nodes perform locally data acqui¬ 
sition as well as processing, at each iteration. However, the constraint of a cyclic path is removed. 
Each node communicates information to its neighboring nodes with which it shares an edge; in this 
way, information is dijfused across the whole network. An advantage of such schemes is that opera- 
tion does not cease if some nodes are malfunctioning. Also, the topology of the network need not be 
fixed. The price one pays for such “extras” is that the final obtained performance, after convergence, 
is inferior to those obtained by its incremental counterpart and by the centralized processing. This 
is natural, since at each iteration every node has access to only a limited amount of information. 
Fig. 5.23B illustrates an example of the topology for ad hoc networks. 

Besides the previous schemes, a number of variants exists. For example, the neighbors of each node 
may change probabilistically, which introduces randomness in the way information is diffused at each 
iteration [33,60]. 


9 Consider a set of nodes x\ .in a graph such that there is an edge connecting (x,_i, Xj), i = 2,... k. The set of edges 

connecting the k nodes is a path. 
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FIGURE 5.24 

A graph corresponding to a network operating under a diffusion strategy. The red dotted line encircles the nodes 
comprising the neighborhood of node k = 6. 


In this section, our focus will be on diffusion schemes. Our aim is to provide the reader with a 
sample of basic techniques around the LMS scheme and not to cover distributed learning in general, 
which is a field on its own with a long history; see [9,19,29,51,76,95,107] for sample references from 
classical to more recent contributions to the field. Besides distributed inference, a number of related 
aspects concerning the topology of the network and learning over graphs are attracting a lot of attention 
in the context of the emerging field of complex networks ; see, [18,37,87,89,100] and the references 
therein. 

5.13.2 THE DIFFUSION LMS 

Let us consider a network of K agents/nodes. Each node exchanges information with the nodes in its 
neighborhood. Given a node k in a graph, let A4 be the set of nodes with which this node shares an 
edge; moreover, node k is also included in A4- This comprises the neighborhood set of k. We will 
denote the cardinality of this set as n k - Fig. 5.24 shows a graph with six nodes. For example, the 
neighborhood of node k = 6 is A/g = [2, 3, 6}, with cardinality «g = 3. The cardinality of A4 is also 
known as the degree of node k. On the contrary, nodes k — 6 and k — 5 are not neighbors, because they 
are not directly linked via an edge. For the needs of the section, we assume that the graph is a strongly 
connected one; that is, there is at least one path of edges that connects any pair of nodes. 

Each node in the network has access to a local data acquisition process, which provides the pair 
of training data 11 ' ( y k (n ), Xk(n)), k = 1,2, ..., K, n = 0,1,..., which are i.i.d. observations drawn 
from the respected stochastic zero mean jointly distributed variables, y*, x*, k — 1,..., K. We further 
assume that, in all cases, the pairs of the input-output variables are associated with a common to ali 
(unknown) parameter vector 0 o . For example, in every node, the data are assumed to be generated by 
a corresponding regression model 

y k = 0lx k + r\ k , k=l,2,...,K, (5.79) 


10 


The time index is used in parentheses, to unclutter notation due to the presence of the node index k. 
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where x k and the zero mean noise variable r|/ ; obey, in general, different statistical properties in each 
node. We will discuss such applications soon. 

Treating each node individually, the MSE optimal solution, which minimizes the local cost func- 
tion. 


Jk(0) = E 



will be given by the respective normal equations, involving the respective covariance matrix and cross- 
correlation vector; in other words, 


Z Xk 0* = P k ■ (5-80) 

Recall that for the case of a regression model, 0 ;r — 6„ and the same solution results from ali nodes. 
Undoubtedly, if the statistics , p k , k — 1,2,..., K, were known, we could stop here. However, we 
already know that this is not the case and in practice they have to be estimated. Alternatively, one has 
to resort to iterative techniques to leam the statistics as well as the unknown parameters. This is where 
one has to consider ali nodes, to benefit from all the observations, which are distributed across the 
network. Thus, a more natural criterion to adopt is 


K K 

J{9) = £ = £e [|y* - 0 T M 2 } ■ (5.81) 

k= 1 k= 1 


Using the Standard arguments, which we have employed a number of times so far, it is readily seen that 
the (common) estimate of the unknown 0„ will be provided as a solution of 

(e = £/>*• 

U=1 / k=l 


Let us use the global cost in (5.81) as our kick-off point to apply a gradient descent optimization 
scheme, 


K 

0(0 = 0(/-D + l _ l J2 (Pk - Zx k 0 {i ~ l) ) , (5.82) 

k= 1 

from which a corresponding stochastic gradient scheme results, by replacing expectations with instan- 
taneous observations and associating iteration steps with time updates, so that 

K 

0 n =0 n ~ i +/*.£ Xk(n)ek(n), 

k= 1 

ek(n) = yk(n) - B*_ x Xk{n), 

and a constant step size has been used, adopting the rationale behind the classical LMS formulation. 
Such an LMS-type recursion is perfect for a centralized scenario, where all data are transmitted to a 
fusion center. This is one of the extremes, having at its opposite end the scenario with nodes acting 
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individually without cooperation. However, there is an intermediate path, which will lead us to the 
distributed diffusion mode of operation. 

Instead of trying to minimize (5.81), let us select a specific node k and construet a local cost as the 
weighted aggregate in A4, represented by 

4° c (0) = c mk J m(0), k=l,2,...,K, (5.83) 

msA4 


so that 

K 

Y, c m k = 1, c m k > 0, and c m k — 0 ifm^A4, m = 1,2,- K. (5.84) 

k= 1 

Let C be the K x K matrix with entries [C] m k = c m k . Then the summation condition in (5.84) can 
be written as 


C1 = 1, (5.85) 

where 1 is the vector with all its entries being equal to 1. That is, ali the entries across a row are 
summing to 1. Such matrices are known as right stochcistic matrices. In contrast, a matrix is said to be 
left stochastic if 

C T 1 = 1. 

Also, a matrix that is both left and right stochastic is known as doubly stochastic (Problem 5.16). Note 
that due to this matrix constraint, we stili have 

K K K K 

E4“«»)=E E CmkJtn (^) 

k= 1 k= 1 meA4 &=1 m =1 

K 

= Y J J mW) = J(Q)- 

m =1 

That is, summing all local costs, the global one results. 

Let us focus on minimizing (5.83). The gradient descent scheme results in 

° ( k = 0 k~ l) + J2 c ™k (Pm - Z*Jk~ l) ) • 
meA4 

However, since nodes in the neighborhood exchange information, they could also share their current 
estimates. This is justified by the fact that the ultimate goal is to reach a common estimate; thus, 
exchanging current estimates could be used for the benefit of the algorithmic process to achieve this 
goal. To this end, we will modify the cost in (5.83) by regularizing it, leading to 

4 oc m= c mk J m (0) + X\\0-O\\ 2 , 

meAfii 


(5.86) 
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where 6 encodes information with respect to the unknown vector, which is obtained by the neighboring 
nodes and X > 0. Applying the gradient descent on (5.86) (and absorbing the factor “2,” which comes 
from the exponents, into the step size), we obtain 

of = o ( r i) +M Cmk ( pm - ^r x) )+^ (o - er ] ) . (5.87) 

meAfk 

which can be broken down into the following two steps: 

Step 1: i/r^ = + Vk J2 Cmk (P>n ~ E *J ( r V) ) - 

meAfk 

Step 2: + [i k X (t) - . 

Step 2 can be slightly modified and replace f)'/ 11 by i/r^\ since this encodes more recent information, 
and we obtain 

(o ~ f ■ 

Furthermore, a reasonable choice of 0 , at each iteration step, would be 


0 = 0 (,) := ^ b mk 

mej\fk\k 

where 

'y '] bmk — 1 ) b m k ^ 0 , 

meAfk\k 

and U k \k denotes the elements in A4 excluding k. In other words, at each iteration, we update Ok so 
as to move it toward the descent direction of the local cost and at the same time we constrain it to stay 
close to the convex combination of the rest of the updates, which are obtained during the computations 
in step 1 from ali the nodes in its neighborhood. Thus, we end up with the following recursions. 
Diffusion gradient descent 


(5.91) 

(5.92) 


Step 1: ff = 6 { l 1} + iM k Cmk ( P »» - E * m e k U ) ’ 

meAfk 

Step 2: 6»[°= Y a mkfm’ 

meAfk 


where we set 


which leads to 


a kk — 1 dkX and a m k — l^k Xb m k , 


y '[ a m k — 1> 

m eAfk 


(5.93) 


fl/nk ^ 


(5.94) 
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for small enough values of pt^X. Note that by setting a m k = 0, m A 4 and defining A to be the matrix 
with entries [A],„,t = a m k, we can write 

K 

J^a mk = l =► A t 1=1, (5.95) 

m=l 

that is, A is a left stochastic matrix. It is important to stress here that, irrespective of our derivation 
before, any left stochastic matrix A can be used in (5.92). 

A slightly different path to arrive at (5.87) is via the interpretation of the gradient descent scheme as 
a minimizer of a regularized linearization of the cost function around the currently available estimate. 
The regularizer used is ||0 — 0 (,-1) || 2 and it tries to keep the new update as close as possible to the 
currently available estimate. In the context of the distributed learning, instead of we can use a 

convex combination of the available estimates obtained in the neighborhood [26,27,84]. 

We are now ready to state the hrst version of the diffusion LMS (DiLMS), by replacing in (5.91) 
and (5.92) expectations with instantaneous observations and interpreting iterations as time updates. 

Algorithm 5.7 (The adapt-then-combine diffusion LMS). 

• Initialize 

- For k = 1,2,..., K, Do 

* 0k (— 1) = 0 e ; or any other value. 

- End For 

- Select /x*, k — 1,2,..., K \ a small positive number. 

- Select C : C1 = 1 

- Select A: A r l = l 

• For n — 0, 1...., Do 

- For k — 1,2,..., K, Do 

* For m e Afk, Do 

• ek,m(n) — }'m(n ) — 0l (n — 1 )x m (n); For complex-valued data, change T —> H. 

* End For 

* f k {n) = 0k(n - 1) + Hk EmeA4 c mkX m {n)ek, m {n ); For complex-valued data, e*, m (n) -* 

<,>)• 

- End For 

- For k — 1,2,, K 

* 0k (n) = EmeA4 a,nk («) 

- End For 

• End For 

The following comments are in order: 

• This form of DiLMS is known as adapt-then-combine (ATC) DiLMS since the hrst step refers to 
the update and the combination step follows. 



236 CHAPTER 5 ONLINE LEARNING 


y 2 (n), x 2 (n) 




FIGURE 5.25 

Adapt-then-combine. (A) In step 1, adaptation is carried out after the exchange of the received observations. (B) In 
step 2, the nodes exchange their locally computed estimates to obtain the updated one. 


• In the special case of C = I, the adaptation step becomes 

f k (n) = 0 k (n- I) + nxk(n)e k (n), 
and nodes need not exchange their observations. 

• The adaptation rationale is illustrated in Fig. 5.25. At time n, all three neighbors exchange the 
received data. In case the input vector corresponds to a realization of a random signal, u/ ; (n), the 
exchange of information comprises two values (y k (n ), u k (n)) in each direction for each one of the 
links. In the more general case, where the input is a random vector of jointly distributed variables, 
all / variables have to be exchanged. After this message passing, adaptation takes place, as shown in 
Fig. 5.25A. Then, the nodes exchange their obtained estimates, i/r k (n), k = 1,2, 3, across the links 
(Fig. 5.25B). 

A different scheme results if one reverses the order of the two steps and performs first the combi- 
nation and then the adaptation. 

Algorithm 5.8 (The combine-then-adapt diffusion LMS). 

• Initialization 

- For k = 1,2,..., K, Do 

* 0 k (— 1) = 0 e ; or any other value. 

- End For 

- Select C : C 1 = 1 

- Select A: A r l=l 

- Select fj, k , k — 1,2,..., K \ a small value. 

• For n — 0, 1,2,..., Do 

- For k = 1,2,..., K, Do 

* fk (« - N = EmsA4 a >nk0m (n - 1) 

- End For 
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- For k — 1,2,. .., K, Do 

* For m e A4, Do 

• ek,m(n) — y'm(n ) — ir j. (n — l)x m (n); For complex-valued data, change T -» //. 

* End For 

* 0* (n) = i/ k (n-\) + fi k x >" ( n ) e k,m (n ); For complex-valued data, (n) -> m (n). 

- End For 
• End For 

The rationale of this adaptation scheme is the reverse of that illustrated in Fig. 5.25, where the phase 
in 5.25B precedes that of 5.25A. In the case C = /, there is no input-output data information exchange 
and the parameter update for the k node becomes 

0 k (n) = f k (n - 1) + HkX k (n)ek(n). 


Remarks 5.4. 

• One of the early reports on the DiLMS can be found in [59]. In [80,93], versions of the algorithm 
for diminishing step sizes are presented and its convergence properties are analyzed. Besides the 
DiLMS, a version for incremental distributed cooperation has been proposed in [58]. For a related 
review, see [84,86,87]. 

• So far, nothing has been said about the choice of the matrices C (A). There are a number of possi- 
bilities. Two popular choices are the following. 

Averaging rule: 

if k — m or if nodes k and m are neighbors, 
otherwise, 

and the respective matrix is left stochastic. 

Metropolis rule: 


Cmk — 


maxjnfr n } ’ if k ^ m and k, m are neighbors, 

Cmk = 1 - J2ieAf k \k c ik . m = 

0, otherwise, 

which makes the respective matrix to be doubly stochastic. 

• Distributed LMS-based algorithms for the case where different nodes estimate different, yet over- 
lapping, parameter vectors have also been derived [20,75]. 

5.13.3 CONVERGENCE AND STEADY-STATE PERFORMANCE: SOME HIGHLIGHTS 

In this subsection, we will summarize some findings concerning the performance analysis of the 
DiLMS. We will not give proofs. The proofs follow similar steps as for the Standard LMS, with slightly 
more involved algebra. The interested reader can obtain proofs by looking at the original papers as well 
as in [84]. 
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• The gradient descent scheme in (5.91), (5.92) is guaranteed to converge, meaning 



provided that 


2 



where 



(5.96) 


This corresponds to the condition in (5.16). 

• If one assumes that C is doubly stochastic, it can be shown that the convergence rate to the solution 
for the distributed case is higher than that corresponding to the noncooperative scenario, when each 
node operates individually, using the same step size, /x* = /i, for ali cases and provided this common 
value guarantees convergence. In other words, cooperation improves the convergence speed. This is 
in line with the general comments made in the beginning of the section. 

• Assume that in the model in (5.79), the involved noise sequences are both spatially and temporally 
white, as represented by 



1 , /■ = 0 , 
0, r ± 0, 


E [t]k(n)r]k(n - r)] = er A 2 <5,., 



1, k = m, n = r, 
0, otherwise. 


E [’n^(^)hm(t")] — Cjp8k m 8 n r , 


Also, the noise sequences are independent of the input vectors, 


E [x m (n)r\k(n — r)] = 0, k,m = 1,2,..., K,Wr, 


and finally, the independence assumption is mobilized among the input vectors, spatially as well as 
temporally, namely. 


E [xk(n)\J n (n — r)] = O, if k ^ m, and Vr. 


Under the previous assumptions, which correspond to the assumptions adopted when studying the 
performance of the LMS, the following hold true for the DiLMS. 

Convergence in the mean: Provided that 


2 



(5.97) 


we have 


ETOUn)!-> 0*, £ = 1,2,_ K. 

L J tt—>oo 
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It is important to state here that the stability condition in (5.97) depends on C and not on A. 

• If in addition to the previous assumption, C is chosen to be doubly stochastic, then the convergence 
in the mean, in any node under the distributed scenario, is faster than that obtained if the node is 
operating individually without cooperation, provided /x* = /x is the same and it is chosen so as to 
guarantee convergence. 

• Misadjustment : under the assumptions of C and A being doubly stochastic, the following are true: 

- The average misadjustment over all nodes in the steady state for the ATC strategy is always 
smaller than that of the combine-then-adapt one. 

- The average misadjustment over all the nodes of the network in the distributed operation is 
always lower than that obtained if nodes are adapted individually, without cooperation, by using 
the same p, k = p in all cases. That is, cooperation does not only improve convergence speed but 
it also improves the steady-state performance. 



FIGURE 5.26 

Average (over all the nodes) error convergence curves (MSD) for the LMS in noncooperative mode of operation 
(dotted line) and for the case of the DiLMS in the ATC mode (red line) and the CTA mode (gray line). The step size 
/x was the same in all three cases. Cooperation among nodes significantly improves performance. For the case of the 
DiLMS, the ATC version results in slightly better performance compared to that of the CTA (MSD in dBs). 


Example 5.8. In this example, a network of L — 10 nodes is considered. The nodes were randomly 
connected with a total number of 32 connections; it was checked the resulting network was strongly 
connected. In each node, data are generated according to a regression model, using the same vector 
6 a el 30 . The latter was randomly generated by Af(0, 1). The input vectors, x k in (5.79), were i.i.d. 
generated according to A r (0. 1) and the noise level was different for each node, varying from 20 to 
25 dBs. 
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Three experiments were carried out. The first involved the distributed LMS in its adapt-then- 
combine (ATC) form and the second one the combine-then-adapt (CTA) version. In the third ex- 
periment, the LMS algorithm was run independently for each node, without cooperation. In ali 
cases, the step size was chosen equal to fi = 0.01. Fig. 5.26 shows the average (over ali nodes) 
MSD(fl): -jr ^ k=1 1| 0 k (n) — 0 O \\ 2 obtained for each one of the experiments. It is readily seen that 
cooperation improves the performance significantly, both in terms of convergence as well as in steady- 
state error floor. Moreover, as stated in Section 5.13.3, the ATC performs slightly better than the CTA 
version. 

5.13.4 CONSENSUS-BASED DISTRIBUTED SCHEMES 

An alternative path for deriving an LMS version for distributed networks was followed in [64,88]. 
Recall that, so far, in our discussion in deriving the DiLMS, we required the update at each node to 
be close to a convex combination of the available estimates in the respective neighborhood. Now we 
will demand such a requirement to become very striet. Although we are not going to get involved with 
details, since this would require to divert quite a lot from the material and the algorithmic tools which 
have been presented so far, let us state the task in the context of the linear MSE estimation. 

To bring (5.81) in a distributed learning context, let us modify it by allowing different parameter 
vectors for each node k, so that 


K 

J(6 u ...,6 K ) = J2^[\yk-0 T k x k \ 2 ]. 

k= 1 

Then the task is cast according to the following constrained optimization problem: 

{9 k , k = 1,..., K} = arg min J(0\,...,0k) 

{»k. k= 1 . K) 

s.t. O k —0 m , k = 1,2,..., K, m e N k . 


In other words, one demands equality of the estimates within a neighborhood. As a consequence, these 
constraints lead to network-wise equality, since the graph that represents the network has been assumed 
to be connected. The optimization is carried out iteratively by employing stochastic approximation 
arguments and building on the alternating direction method of multipliers (ADMM) (Chapter 8) [19]. 
The algorithm, besides updating the vector estimates, has to update the associated Lagrange multipliers 
as well. 

In addition to the previously reported ADMM-based scheme, a number of variants known as 
consensus-based algorithms have been employed in several studies [19,33,49,50,71]. A formulation 
around which a number of stochastic gradient consensus-based algorithms evolve is the following [33, 
49,50]: 


0 k (n) — 0 k (n - l) + nk(n) 


x k (n)e k (n) + k 


T, {0k(n ~ 1 ) ~0 m {n - 1 )) 

nej\fk\k 


(5.98) 


where 

et :(«) := yk(n) ~ 0[(n ~ 1 )x k (n) 
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and for some k > 0. Observe the forni in (5.98); the terni in brackets on the right-hand side is a regular- 
izer whose goal is to enforce equality among the estimates within the neighborhood of node k. Several 
alternatives to Eq. (5.98) have been proposed. For example, in [49] a different step size is employed 
for the consensus summation on the right-hand side of (5.98). In [99], the following formulation is 
provided: 


0 k (n)=O k (n- 1 ) + Hk(n) 


x k (n)e k (n) + ^ b m , k (0 k (n - 1) - 0 m (n - 1)) 

mej\f k \k 


where b mk stands for some nonnegative coefficients. If one defines the weights, 


a m,k •- 


1 - EmeA4\t ^k(n)b m ,k, m = k , 

Hk(n)b m ,k, m e Af k \ k, 

0, otherwise, 


recursion (5.99) can be equivalently written as 


0k(n)= ^2 a m ^ k 0 m (n-l) +ix k (n)xk(n)e k (n). 
msAf k 


(5.99) 


(5.100) 


(5.101) 


The update rule in (5.101) is also referred to as consensus strategy (see, e.g., [99]). Note that the 
step size is considered to be time-varying. In particular, in [19,71], a diminishing step size is employed, 
within the stochastic gradient rationale, which has to satisfy the familiar pair of conditions in order to 
guarantee convergence to a consensus value over all the nodes, 

OO OO 

Y.Hkin) = oo, Y.nlin) < oo. (5.102) 

n =0 n =0 

Observe the update recursion in (5.101). It is readily seen that the update 0 k (n) involves only the error 
e k (n) of the corresponding node. In contrast, looking carefully at the corresponding update recursions 
in both Algorithms 5.7 and 5.8, 0 k (n ) is updated according to the average error within the neighbor¬ 
hood. This is an important difference. 

The theoretical properties of the consensus recursion (5.101), which employs a constant step size, 
and a comparative analysis against the diffusion schemes have been presented in [86,99]. There, it has 
been shown that the diffusion schemes outperform the consensus-based ones, in the sense that (a) they 
converge faster, (b) they reach a lower steady-state mean-square deviation error floor, and (c) their 
mean-square stability is insensitive to the choice of the combination weights. 


5.14 A CASE STUDY: TARGET LOCALIZATION 

Consider a network consisting of K nodes, whose goal is to estimate and track the location of a specific 
target. The location of the unknown target, say, 0 O , is assumed to belong to the two-dimensional space. 
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The position of each node is denoted by 0 k — [0 k \, 6 k 2 \ T , and the true distance between node k and the 
unknown target is equal to 

n=\\0 o -0 k \\. (5.103) 

The vector whose direction points from node k to the unknown source is given by 


0 o -0 k 

8k \\0o-e k \\ 

Obviously, the distance can be rewritten in terms of the direction vector as 

n = gl(°o- e k)- 


(5.104) 


(5.105) 


It is reasonable to assume that each node k “senses” the distance and the direction vectors via noisy 
observations. For example, such a noisy information can be inferred from the strength of the received 
signal or other related information. Following a similar rationale as in [84,98], the noisy distance can 
be modeled as 


h ( n ) = n + v k (; n ), (5 . 106 ) 

where n stands for the discrete-time instance and v k (n) for the additive noise term. The noise in the 
direction vector is a consequence of two effects: (a) a deviation occurring along the perpendicular 
direction to g k and (b) a deviation that takes place along the parallel direction of g k . Ali in one, the 
noisy direction vector (see Fig. 5.27) occurring at time instance n can be written as 

8k( n ) = 8k + (n)gk + v\{n)g k , ( 5 . 107 ) 

where v k (n) is the noise corrupting the unit norm perpendicular direction vector gjr and v\ (n) is the 
noise occurring at the parallel direction vector. Taking into consideration the noisy terms, (5.106) is 
written as 

h(n ) = g k (n)(6 0 ~ 0k) + Vk(n), ( 5 . 108 ) 

where 

Vk(n) = Vk(n) - v k (n)g k T (6 0 - 0 k ) - v\(ri)gl (0 o - 0 k ). ( 5 . 109 ) 

Eq. (5.109) can be further simplified if one recalls that by construction g k T {0 O — 0 k ) = 0. Moreover, 
typically, the contribution of v k (n) is assumed to be significantly larger than the contribution of v\{n). 
Henceforth, taking into consideration these two arguments, (5.109) can be simplified to 

Vk(n) ~ v k (n). ( 5 . 110 ) 

If one defines yk(n) r k (n) + g k (n)0 k and combines (5.108) with (5.110) the following model re- 
sults: 


yk(n)^0 T o g k {n) + Vk<,n). 


( 5 . 111 ) 
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FIGURE 5.27 

Illustration of a node, the target source, and the direction vectors. 


Eq. (5.1 1 1) is a linear regression model. Using the available estimates, for each time instant, one has 
access to y^in ), g^{n ) and any form of distributed algorithm can be adopted in order to obtain a better 
estimate of 0 O . 

Indeed, it has been verified that the information exchange and fusion enhances significantly the 
ability of the nodes to estimate and track the target source. The nodes can possibly represent fish 
schools which seek a nutrition source, bee swarms which search for their hive, or bacteria seeking 
nutritive sources [28,84,85,97]. 

Some other typical applications of distributed learning are social networks [36], radio resource 
allocation [32], and network cartography [65]. 


5.15 SOME CONCLUDING REMARKS: CONSENSUS MATRIX 

In our treatment of the DiLMS, we used the combination matrices A (C), which we assumed to be left 
(right) stochastic. Also, in the performance-related section, we pointed out that some of the reported 
results hold true if these matrices are, in addition, doubly stochastic. Although it was not needed in this 
chapter, in the general distributed processing theory, a matrix of significant importance is the so-called 
consensus matrix. A matrix A e R KxK is said to be a consensus matrix if, in addition to being doubly 
stochastic, as represented by 

Al = l, A t 1=1, 


it also satisfies the property 


Xi{A 


T 


1 

K 


ir 


< i, 




A t 



T 


In other words, ali eigenvalues of the matrix 
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have magnitude strictly less than one. To demonstrate its usefulness, we will state a fundamental theo- 
rem in distributed learning. 

Theorem 5.2. Consider a network consisting of K nodes, each one of them having access to a state 
vector Xk- Consider the recursion 

0^ = a mk 0^, k = 1, 2,..., K, i > 0 : consensus iteration, 
msA4 

with 

df ) =x k , k = 1,2,..., K. 

Define A e to be the matrix with entries 

I^LjiA' = dmki m,k— 1,2,..., K, 


where a mk > 0, a m k = 0 ifm ^ A4- IfA is a consensus matrix, then [31] 


e 


(0 

k 


1 

~K 


K 




The opposite is also true. If convergence is always guaranteed, then A is a consensus matrix. 

In other words, this theorem States that updating each node by convexly combining, with appropri- 
ate weights, the current estimates in its neighborhood, the network converges to the average value in a 
consensus rationale (Problem 5.17). 


PROBLEMS 

5.1 Show that the gradient vector is perpendicular to the tangent at a point of an isovalue curve. 

5.2 Pro ve that if 

OO OO 

^/rf<oo, ^/x,=oo, 

i=i i=i 

the gradient descent scheme, for the MSE cost function and for the iteration dependent step size 
case, converges to the optimal solution. 

5.3 Derive the steepest gradient descent direction for the complex-valued case. 

5.4 Let 9, x be two jointly distributed random variables. Let also the function (regression) 

/(0)=E[x|0]. 

assumed to be an increasing one. Show that under the conditions in (5.29), the recursion 

bfi = @n— 1 dn Xn 

converges in probability to the root of ,f(0). 
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5.5 Show that the LMS algorithm is a nonlinear estimator. 

5.6 Show Eq. (5.42). 

5.7 Derive the bound in (5.45). 

Hint: Use the well-known property from linear algebra that the eigenvalues of a matrix A e LRi / x 1 
satisfy the following bound: 

l 

max |A,-1 < max y^|a//| := ||A||i. 
l <i</ l </</ 1— 1 

- - j= i 

5.8 Gershgorin circle theorem. Let A be an l x / matrix, with entries i, j = 1, 2,..., l. Let Rj := 

\ a ij\ be the sum of absolute values of the nondiagonal entries in row i. Show that if X is 

j¥=‘ 

an eigenvalue of A, then there exists at least one row i such that the following is true: 

IA da | < Ri ■ 

The last bound defines a circle which contains the eigenvalue X. 

5.9 Apply the Gershgorin circle theorem to prove the bound in (5.45). 

5.10 Derive the misadjustment formula given in (5.52). 

5.11 Derive the APA iteration scheme. 

5.12 Consider the hyperplane that comprises ali the vectors 0 such as 

xj, 0 - y n = 0 , 

for a pair (y n ,x n ). Show that x„ is perpendicular to the hyperplane. 

5.13 Derive the recursions for the widely linear APA. 

5.14 Show that a similarity transformation of a square matrix via a unitary matrix does not affect the 
eigenvalues. 

5.15 Show that if x e K / is a Gaussian random vector, then 

F := E[xx t Sxx t ] = E vtracefSTjc} + 2 Y, x SYi x , 

and if x e C l , 

F E[xx H Sxx H ] = EftracefSEv} + E A S E v . 

5.16 Show that if an / x / matrix C is right stochastic, then ali its eigenvalues satisfy 

| A. j | < 1, i — 1,2,...,/. 

The same holds true for left and doubly stochastic matrices. 

5.17 Prove Theorem 5.2. 
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MATLAB® EXERCISES 

5.18 Consider the MSE cost function in ( 5 . 4 ). Set the cross-correlation equal to p — [0.05, 0.03]^. 
Also, consider two covariance matrices, 


Compute the corresponding optimal Solutions, 0(*,i) = ITj -1 p. 0<* t 2) = TT ' P ■ Apply the gradi- 
ent descent scheme of (5.6) to estimate 0 (*^)\ set the step size equal to (a) its optimal value //„ 
according to (5.17) and (b) equal to /x 0 /2. For these two choices for the step size, plot the error 
II0 (O - 0(*,2)II 2 at eacli iteration step i. Compare the convergence speeds of these two curves 
towards zero. Moreover, in the two-dimensional space, plot the coefficients of the successive 
estimates, 0 { '\ for both step sizes, together with the isovalue contours of the cost function. What 
do you observe regarding the trajectory towards the minimum? 

Apply (5.6) for the estimation of employing 1 and p. Use as step size /i„ of the pre- 
vious experiment. Plot, in the same figure, the previously computed error curve \\0 il> — 0(*,2)|| 2 
together with the error curve II# 1 -') — || 2 . Compare the convergence speeds. 

Now set the step size equal to the optimum value associated with E\. Again, in the two- 
dimensional space, plot the values of the successive estimates and the isovalue contours of the 
cost function. Compare the number of steps needed for convergence, with the ones needed in the 
previous experiment. Play with different covariance matrices and step sizes. 

5.19 Consider the linear regression model 

yn — X/i 0o "E Pn > 

where 0 O e R 2 . Generate the coefficients of the unknown vector 0 O randomly according to the 
normalized Gaussian distribution, 3/(0. 1). The noise is assumed to be white Gaussian with 
variance 0.1. The samples of the input vector are i.i.d. generated via the normalized Gaussian. 
Apply the Robbins-Monro algorithm in (5.34) for the optimal MSE linear estimation with a 
step size equal to /z„ = l/n. Run 1000 independent experiments and plot the mean value of the 
first coefficient of the 1000 produced estimates, at each iteration step. Also, plot the horizontal 
line Crossing the true value of the first coefficient of the unknown vector. Furthermore, plot the 
Standard deviation of the obtained estimate, every 30 iteration steps. Comment on the results. 
Play with different rules of diminishing step sizes and comment on the results. 

5.20 Generate data according to the regression model 

yn — X n 0 o -E 0n > 

where 0„ e R 10 , and whose elements are randomly obtained using the Gaussian distribution 
3/(0, 1). The noise samples are also i.i.d. generated from 3/(0, 0.01). 

Generate the input samples as part of two processes: (a) a white noise sequence, i.i.d. gener¬ 
ated via 3/(0, 1), and (b) an autoregressive AR(1) process with a i =0.85 and the corresponding 
white noise excitation of variance equal to 1. For these two choices of the input, run the LMS al¬ 
gorithm on the generated training set (y„, x n ), n = 0 , 1,..., to estimate 0 O . Use a step size equal 
to /x = 0.01. Run 100 independent experiments and plot the average error per iteration in dBs, 







REFERENCES 247 


using 10 log| 0 (e^), with e 2 = (y n — 0 l n ^x n ) 2 . What do you observe regarding the convergence 
speed of the algorithm for the two cases? Repeat the experiment with different values of the AR 
coefficient a\ and different values of the step size. Observe how the learning curve changes with 
the different values of the step size and/or the value of the AR coefficient. Choose, also, a rela- 
tively large value for the step size and make the LMS algorithm to diverge. Comment and justify 
theoretically the obtained results concerning convergence speed and the error floor at the steady 
state after convergence. 

5.21 Use the data set generated from the AR(1) process of the previous exercise. Employ the 
transform-domain LMS (Algorithm 5.6) with step size equal to 0.01. Also, set <5 = 0.01 and 
/1 = 0.5. Moreover, employ the DCT transform. As in the previous exercise, run 100 indepen- 
dent experiments and plot the average error per iteration. Compare the results with those of the 
LMS with the same step size. 

Hint: Compute the DCT transformation matrix using the dctmtx MATLAB® function. 

5.22 Generate the same experimental setup as in Exercise 5.20, with the difference that 0 o e R 60 . For 
the LMS algorithm set /x = 0.025 and for the NLMS (Algorithm 5.3) /x = 0.35 and 8 — 0.001. 
Employ also the APA (Algorithm 5.2) with parameters /x = 0.1, 8 — 0.001, and q = 10, 30. Plot 
in the same figure the error learning curves of ali these algorithms, as in the previous exercises. 
How does the choice of q affect the behavior of the APA, in terms of both convergence speed 
and the error floor at which it settles after convergence? Play with different values of q and of 
the step size /x. 

5.23 Consider the decision feedback equalizer described in Section 5.10. 

(a) Generate a set of 1000 random ±1 values (BPSK) (i.e., s„). Direct this sequence into alinear 
channel with impulse response h— [0.04, —0.05, 0.07, —0.21,0.72, 0.36, 0.21, 0.03, 0.07 ] T 
and add to the output 11 dB white Gaussian noise. Denote the output as u n . 

(b) Design the adaptive DFE using L — 21, l — 10, and /x = 0.025 following the training mode 
only. Perform a set of 500 experiments feeding the DFE with different random sequences 
from the ones described in step (a). Plot the MSE (averaged over the 500 experiments). 
Observe that around n — 250 the algorithm achieves convergence. 

(c) Design the adaptive decision feedback equalizer using the parameters of step (b). Feed the 
equalizer with a series of 10,000 random values generated as in step (a). After the 250th 
data sample, change the DFE to decision-directed mode. Count the percentage of the errors 
performed by the equalizer from the 25 lth to the 10,000th sample. 

(d) Repeat steps (a) to (c) changing the level of the white Gaussian noise added to the BPSK 
values to 15, 12, 10 dB. Then, for each case, change the delay to L — 5. Comment on the 
results. 

5.24 Develop the MATLAB® code for the two forms of the DiLMS, ATC and CTA, and reproduce 
the results of Example 5.8. Play with the choice of the various parameters. Make sure that the 
resulting network is strongly connected. 
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6.1 INTRODUCTION 

The squared error loss function was at the center of our attention in the previous two chapters. The 
sum of squared errors cost was introduced in Chapter 3, followed by the mean-square error (MSE) 
version, treated in Chapter 4. The stochastic gradient descent technique was employed in Chapter 5 to 
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help us bypass the need to perform expectations for obtaining the second-order statistics of the data, as 
required by the MSE formulation. 

In this chapter, we return to the original formulation of the sum of error squares, and our goal is to 
look more closely at the resulting family of algorithms and their properties. An emphasis is given on 
the geometric interpretation of the least-squares (LS) method as well as on some of the most important 
statistical properties of the resulting solution. The singular value decomposition (SVD) of a matrix is 
introduced for a first time in this book. Its geometric orthogonalizing properties are discussed and its 
connection with what is known as dimensionality reduction is established; the latter topic is extensively 
treated in Chapter 19. Also, a major part of the chapter is dedicated to the recursive LS (RLS) algorithm, 
which is an Online scheme that solves the LS optimization task. The spine of the RLS scheme comprises 
an efficient update of the inverse (sample) covariance matrix of the input data, whose rationale can also 
be adopted in the context of different learning methods for developing related Online schemes; this 
is one of the reasons we pay special tribute to the RLS algorithm. The other reason is its popularity 
in a large number of signal processing/machine learning tasks, due to some attractive properties that 
this scheme enjoys. Two major optimization schemes are introduced, namely, NewtoiTs method and 
the coordinate descent method, and their use in solving the LS task is discussed. The bridge between 
the RLS scheme and Newton’s optimization method is established. Finally, at the end of the chapter, 
a more general formulation of the LS task, known as the total least-squares (TLS) method, is also 
presented. 


6.2 LEAST-SQUARES LINEAR REGRESSION: A GEOMETRIC PERSPECTIVE 

The focus of this section is to outline the geometric properties of the LS method. This provides an 
alternative view on the respective minimization method and helps in its understanding, by revealing 
a physical structure that is associated with the obtained solution. Geometry is very important when 
dealing with concepts related to the dimensionality reduction task. 

We begin with our familiar linear regression model. Given a set of observations, 

y„ = 0 T x n + rj n , n = 1,2,..., N , y n e R, x n e M 1 , 0 e M 1 , 


where ri n denotes the (unobserved) values of a zero mean noise source, the task is to obtain an estimate 
of the unknown parameter vector, 0 , so that 


N 

0LS = argmin^(y„ - 0 1 x,,) 2 . 

71= l 


( 6 . 1 ) 


Our stage of discussion is that of real numbers, and we will point out differences with the complex 
number case whenever needed. Moreover, we assume that our data have been centered around their 
sample means; alternatively, the intercept, 6q, can be absorbed in 0 with a corresponding increase in 
the dimensionality of x„. Define 
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y = 

vi 

eR", X:= 

T 

x i 


_W_ 


1 

X 

1 _ 


Eq. (6.1) can be recast as 


0 LS = argmin | 
e 


where 


( 6 . 2 ) 


e:=y-X0 

and || • || denotes the Euclidean norm, which measures the “distance” between the respective vectors in 
i.e., y and XO. Indeed, the «th component of the vector e is equal to e n = y n — xl 6, which, due 
to the symmetry of the inner product, is equal to y„ — 0 1 x n ; furthermore, the square Euclidean norm 
of the vector is the sum of the squares of its components, which makes the square norm of e identical 
to the sum of squared errors cost in Eq. (6.1). 

Let us now denote as x ^,..., x^ e R iV the columns of X, i.e.. 


Then the matrix-vector product above can be written as the linear combination of the columns of matrix 
X, i.e., 

1 

y:=X0 = J2 6 ' X i’ 

i=l 

and 


e = y-y. 

Note the e can be viewed as the error vector between the vector of the output observations, y, and y; 
the latter is the prediction of y based on the input observations, stacked in X, and given a value for 0. 
Obviously, the /V-dimensional vector y, being a linear combination of the columns of X, lies in the 
spanfjCj,..., x c A. By definition, the latter is the subspace of R 7 that is generated by ali possible linear 
combinations of the / columns of X (see Appendix A). Thus, naturally, our task now becomes that of 
selecting 0 so that the error vector between y and y has minimum norm. In general, the observations 
vector, y, does not lie in the subspace spanned by the columns of X , due to the existence of the noise. 

According to the Pythagorean theorem of orthogonality for Euclidean spaces, the minimum norm 
error is obtained if y is chosen as the orthogonal projectiori of y onto the span {jCj,..., x(). Recalling 
the concept of orthogonal projections (Appendix A and Section 5.6, Eq. (5.65)), the orthogonal projec- 
tion of y onto the subspace spanned by the columns of X is given by 


y = X(X T Xy l X T y: LS estimate. 


(6.3) 
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FIGURE 6.1 

In the figure, y lies outside the two-dimensional (shaded) plane that is defined by the two column vectors, jCj and 
x c 2 , i.e., spanfjTj, x^,}. From all the points on this plane, the one that is closest to y, in the minimum error norm 
sense, is its respective orthogonal projection, i.e., the point y. The parameter vector 8 associated with this orthogo- 
nal projection coincides with the LS estimate. 


assuming that X T X is invertible. Recalling the dehnition of y, the above corresponds to the LS estimate 
for the unknown set of parameters, as we know it form Chapter 3. i.e., 

0 = (X T X)~ l X T y. 


The geometry is illustrated in Fig. 6.1. 

It is common to describe the LS solution in terms of the Moore-Penrose pseudoinverse of X, which 
for a tali matrix is defined as 


X' := (X T X) 1 X 1 : pseudoinverse of a tali matrix X, 


(6.4) 


and hence we can write 


0 L s = X^y. (6.5) 

Thus, we have rederived Eq. (3.17) of Chapter 3, this time via geometric arguments. Note that the 
pseudoinverse is a generalization of the notion of the inverse of a square matrix. Indeed, if X is square, 
then it is readily seen that the pseudoinverse coincides with X~ l . For complex-valued data, the only 
difference is that transposition is replaced by the Hermitian one. 


A matrix, e.g., X 6 R NxI 


1 


, is called tali , if N > /. If N < l it is known as fat. lf l = N, it is a square matrix. 
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6.3 STATISTICAL PROPERTIES OF THE LS ESTIMATOR 

Some of the statistical properties of the LS estimator were touched on in Chapter 3, for the special case 
of a random real parameter. Here, we will look at this issue in a more general setting. Assume that there 
exists a true (yet unknown) parameter/weight vector 0 O that generates the output (dependent) random 
variables (stacked in a random vector y e M^), according to the model 

y= X0 o + r|, 

where t) is a zero mean noise vector. Observe that we have assumed that X is fixed and not random; that 
is, the randomness underlying the output variables y is due solely to the noise. Under the previously 
stated assumptions, the following properties hold. 

THE LS ESTIMATOR IS UNBIASED 

The LS estimator for the parameters is given by 
e LS = (X r X)- 1 Z r y, 

= (X T X)-'X T (X0 O + t)) = 0 O + ( X T X)~ l X T x ), (6.6) 


or 

E[e LS ] = o 0 + (x T xr l x T E[i)i = 0 O , 

which proves the claim. 

COVARIANCE MATRIX OF THE LS ESTIMATOR 

Let, in addition to the previously adopted assumptions, 

E[r)ti r ] =rfl, 

that is, the source generating the noise samples is white. By the definition of the covariance matrix, we 
get 

= E [(0 LS - 0 O )(0 LS - e o) Tl \ , 

and substituting 0 ls — 0o from (6.6), we obtain 

^=e[(I^)- 1 / 11 /I(X 7 X)- 1 ] 

= (X T X)- { X Tl &[x\x\ r }X(X T X)- x 
= rf(X T X)-'. 

Note that, for large values of N , we can write 

X T X = J2 x n x Z*NZx, 

n= 1 


(6.7) 
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where E x is the covariance matrix of our (zero mean) input variables, i.e., 


E x :=E[x„xJ] 


1 

N 


N 

Y^ X " X n- 

n =1 


Thus, for large values of N, we can write 



( 6 . 8 ) 


In other words, under the adopted assumptions, the LS estimator is not only unbiased, but its covariance 
matrix tends asymptotically to zero. That is, with high probability, the estimate #ls> which is obtained 
via a large number of measurements, will be close to the true value 0 O . Viewing it slightly differently, 
note that the LS solution tends to the MSE solution, which was discussed in Chapter 4. Indeed, for the 
case of centered data. 


1 

lim — 
N—>oo N 


N 

J2 x n x n 

n =1 


= Er 


and 


1 

lim — 

7V—>oo N 


N 

^ ~2x„y„ = E[xy] 

n= 1 


= p 


Moreover, we know that for the linear regression modeling case, the normal equations, E x 0 — p, resuit 
in the solution 0 = 0 O (Remarks 4.2). 


THE LS ESTIMATOR IS BLUE IN THE PRESENCE OF WHITE NOISE 

The notion of the best linear unbiased estimator (BLUE) was introduced in Section 4.9.1 in the context 
of the Gauss-Markov theorem. Let 0 denote any other linear unbiased estimator, under the assump- 
tion that 

E[W ] = <T“/. 

Then, due to the linearity assumption, the estimator will have a linear dependence on the output random 
variables that are observed, i.e., 

0 = Hy, H e R lxN . 


It will be shown that the variance of such an estimator can never become smaller than that of the LS 
one, i.e., 


E[(0-0 o ) r (0-0 o ) 


>E[(0 is -0 o ) r (0 LS -0 o )]. 


Indeed, from the respective definitions we have 


(6.9) 


0 = H(X0 o + ri) = HX0 o + Hr\. 


(6.10) 
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However, because 0 has been assumed unbiased, (6.10) implies that HX — I and 

G-0 o = H\). 


Thus, 




[(0-0 o )(0-0 o ) 


= a:HH r . 


T 


However, taking into account that HX — I, it is easily checked (try it) that 

o^HH t = rf{H - X^)(H - X^) T + af 1 {X T X)~\ 


where X' is the respective pseudoinverse matrix, defined in (6.4). 

Because o^(H — X^)(H — X^) T is a positive semidefinite matrix (Appendix A), its trace is non- 
negative (Problem 6.1) and thus we have 

trac ^{a^HH 1 } > trac e{a^(X 7 X) -1 }, 


and recalling (6.7), we have proved that 


tracefX^} > tracefi;^}. (6.11) 

However, recalling from linear algebra the property of the trace (Appendix A), we have 

tracefr^} = tracej E [^(0 - 0 O )(0 - 0 O ) T j J = E |^(0 — 0 o ) T (() - 0 o )j, 

and similarly for ■ Hence, Eq. (6.1 1) above leads directly to (6.9). Moreover, equality holds only 
if 

// = X t = (X T X)~ { X T . 

Note that this resuit could have been obtained directly from (4.102) by setting = a^I. This also 
emphasizes the fact that if the noise is not white, then the LS parameter estimator is no more BLUE. 


THE LS ESTIMATOR ACHIEVES THE CRAMER-RAO BOUND FOR WHITE GAUSSIAN 
NOISE 

The concept of the Cramer-Rao lower bound was introduced in Chapter 3. There, it was shown that, 
under the white Gaussian noise assumption, the LS estimator of a real number was efficient ; that is, 
it achieves the Cramer-Rao bound. Moreover, in Problem 3.9, it was shown that if r) is zero mean 
Gaussian noise with covariance matrix X) ; , then the efficient estimator is given by 

0 = (x r r- 1 xr 1 x r r- 1 y, 

which for = afjl coincides with the LS estimator. In other words, under the white Gaussian noise 
assumption, the LS estimator becomes a minimum variance unbiased estimator (MVUE). This is a 
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strong resuit. No other unbiased estimator ( not necessarily linear ) will do better than the LS one. Note 
that this resuit holds true not only asymptotically, but also for a finite number of samples N. If one 
wishes to decrease further the mean-square error, then biased estimators, as produced via regulariza- 
tion, have to be considered; this has already been discussed in Chapter 3; see also [16,50] and the 
references therein. 


ASYMPTOTIC DISTRIBUTION OF THE LS ESTIMATOR 


We have already seen that the LS estimator is unbiased and that its covariance matrix is approximately 
(for large values of N) given by (6.8). Thus, as N —> oo, the variance around the true value, 0 O , is 
becoming increasingly small. Furthermore, there is a stronger resuit, which provides the distribution 
of the LS estimator for large values of N. Under some general assumptions, such as independence 
of successive observation vectors and that the white noise source is independent of the input, and 
mobilizing the Central limit theorem, it can be shown (Problem 6.2) that 


VN(e LS - 0 o ) —> AAO, rfE- 1 ), 


( 6 . 12 ) 


where the limit is meant to be in distribution (see Section 2.6). Alternatively, we can write 



In other words, the LS parameter estimator is asymptotically distributed according to the normal dis¬ 
tribution. 


6.4 ORTHOGONALIZING THE COLUMN SPACE OF THE INPUT MATRIX: THE 
SVD METHOD 


The singular value decomposition (SVD) of a matrix is among the most powerful tools in linear al- 
gebra. As a matter of fact, it will be the tool that we are going to use as a starting point to deal with 
dimensionality reduction in Chapter 19. Due to its importance in machine learning, we present the 
basic theory here and exploit it to shed light on our LS estimation task from a different angle. We start 
by considering the general case, and then we tailor the theory to our specific needs. 

Let X be an m x I matrix and allow its rank r not to be necessarily full (Appendix A), i.e., 


r < min{m, /}. 


Then there exist orthogonal matrices,- U and V, of dimensions m x m and / x /, respectively, so that 


- Recall that a square matrix U is called orthogonal if U T U = UU T = 1. For complex-valued square matrices, if U^U = 
UU H = I, U is called unitary. 
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x = u 


D O 
O O 


V T : 


singular value decomposition of X , 


(6.13) 


where D is an r x r diagonal matrix with elements er,- = ■ S /Xi, known as the singular values of X, 
where X,-, i = 1,2,... ,r, are the nonzero eigenvalues of XX T \ matrices denoted as O comprise zero 
elements and are of appropriate dimensions. 

Taking into account the zero elements in the diagonal matrix, (6.13) can be rewritten as 

r 

X =U r D Vj =J2 (J i u ' v J, (6.14) 

i= 1 

where 

U r :=[ui . « f ]er xr , V r :=[v\,..., v r ] e R /xr . (6.15) 

Eq. (6.14) provides a matrix factorization of X in terms of U r , V r , and D. We will make use of this 
factorization in Chapter 19, when dealing with dimensionality reduction techniques. Fig. 6.2 offers a 
schematic illustration of (6.14). 


I 








X 

= m 


Ur 








“ 


D 

r 

V r T 


„ 



FIGURE 6.2 

The in x l matrix X, of rank r < /}, factorizes in terms of the matrices U r € M mxr , V r € R ,xr and the r x r 

diagonal matrix D. 

It turns out that «,■ e R" 5 , i = 1, 2.r, known as left singular vectors, are the normalized eigen- 

vectors corresponding to the nonzero eigenvalues of XX 7 , and n, e RE i = 1.2,..., r, are the nor¬ 
malized eigenvectors associated with the nonzero eigenvalues of X T X, and they are known as right 
singular vectors. Note that both XX T and X 1 X share the same eigenvalues (Problem 6.3). 

Proof. By the respective definitions (Appendix A), we have 

XX T Ui=XjUj, i = l,2,...,r, (6.16) 


3 Usually it is denoted as E, but here we avoid the notation so as not to confuse it with the covariance matrix E ; D reminds us 
of its diagonal structure. 
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and 


X T Xvi=XiVi, / = 1 , 2 , 


(6.17) 


Moreover, because XX 1 and X T X are symmetric matrices, it is known from linear algebra that their 
eigenvalues are real 4 and the respective eigenvectors are orthogonal, which can then be normalized to 
unit norm to become orthonormal (Problem 6.4). It is a matter of simple algebra (Problem 6.5) to show 
from (6.16) and (6.17) that 

1 

ut — — Xvj, i = 1,2, (6.18) 

07 

Thus, we can write 

Y^OiUiv] = xJ2vivf = xJ2 v i v i = XVV T , 

i=l i=i i=i 


where we used the fact that for eigenvectors corresponding to er,- = 0 (A.,- = 0), i = r + 1,..., Z, Xvj — 
0. However, due to the orthonormality of Vj, i = 1, 2,..., /, VV 1 = I and the claim in (6.14) has been 
proved. □ 


PSEUDOINVERSE MATRIX AND SVD 

Let us now elaborate on the SVD expansion and investigate its geometric implications. By the definition 
of the pseudoinverse, Z 4 , and assuming the N x / (N > l) data matrix to be full column rank (r = /), 
employing (6.14) in (6.5) we get (Problem 6.6) 


y = XO LS = X(X T Xy 1 X T y 


= U t U, y = [u u ...,ui] 


T 

u[ y 


T 

L u i y J 


or 


/ 

y = y)iij : LS estimate in terms of an orthonormal basis. 

i =1 


(6.19) 


The latter represents the projectiori of y onto the column space of X, i.e., span{x'p ..., xj} using a 

corresponding orthonormal basis, {mi, _«/}, to describe the subspace (see Fig. 6.3). Note that each 

ut, / = 1,2,..., /, lies in the space spanned by the columns of X as suggested from Eq. (6.18). In 
other words, the SVD of matrix X provides the orthonormal basis that describes the respective column 
space. 


4 This is also true for complex matrices, IX®, X. 
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FIGURE 6.3 

The eigenvectors u\, « 2 , lie in the column space of X, i.e., in the (shaded) plane span{Xj, x^}, and form an or- 
thonormal basis. Because y is the projection of y onto this subspace, it can be expressed as a linear combination of 
the two vectors of the orthonormal basis. The respective weights for the linear combination are u\y and uJ, y. 


We can further use the previous results to express the pseudoinverse in terms of the eigenval- 
ues/eigenvectors of X T X (XX T ). It is easily shown that we can write 

i , 

X f = ( X T X)~ l X T = V,D~ l UJ = —Viuf. 

i=l ai 

As a matter of fact, this is in line with the more general definition of a pseudoinverse in linear algebra, 
including matrices that are not full rank (i.e., X T X is not invertible), namely, 


:= V r D~ * l UJ 




i=t 


pseudoinverse of a matrix of rank r. 


( 6 . 20 ) 


In the case of matrices with N < l, and assuming that the rank of X is equal to N, it is readily verified 
that the previous generalized definition of the pseudoinverse is equivalent to 


X' = X T (XX T ) 1 : pseudoinverse of a fat matrix X. 


( 6 . 21 ) 


Note that a system with N equations and l > N unknowns, 


xe=y, 

has infinite Solutions. Such systems are known as underdetermined , to be contrasted with the overde- 
termined systems for which N > l. It can be shown that for underdetermined systems, the solution 
0 = X' y is the one with the minimum Euclidean norm. We will consider the case of such systems of 
equations in more detail in Chapter 9, in the context of sparse models. 
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Remarks 6.1. 

Here we summarize some important properties from linear algebra that are related to the SVD decom- 
position of the matrix. We will make use of these properties in various parts of the book later on. 

• Computing the pseudoinverse using the SVD is numerically more robust than the direct method via 
the inversion of ( X 1 X)~ l . 

• k rank matrix approximation: The best rank k < r < min(m, l) approximation matrix, X e M. mxl , of 
X e R mxl in the Frobenius, || • ||/r, as well as in the spectral, || • || 2 , norms sense is given by (e.g., 
[26]) 

k 

X — , (6.22) 

i=l 

with the previously stated norms defined as (Problem 6.9) 


II^IIf := /EEl z (0 7)l 2 = 

r 

Y, of : Frobenius norm of X, 

V i j \ 

i—] 


(6.23) 


and 


X\\ 2 '.= <y\'. spectral norm of X, 


(6.24) 


where eri > cr 2 > ... > o r > 0 are the singular values of X. In other words, X in (6.22) minimizes 
the error matrix norms, 


\\X-X\\ F and \\X - X\\ 2 . 

Moreover, it turns out that the approximation error is given by (Problems 6.10 and 6.1 1) 


\X - X||f = 


E 


\X — X||2 = <Tk+ 1- 


\J i=k +1 


This is also known as the Eckart-Young-Mirsky theorem. 

• Null and range spaces ofX\ Let the rank of an m x / matrix X be equal to r < min{wi, / }. Then the 
following easily shown properties hold (Problem 6.13). The null space of X, M(X), defined as 

AT(X):={jreR':Xjt: = 0}, (6.25) 


is also expressed as 


N(X) = span{iv + i.v/}. 


Furthermore, the range space of X, 1Z(X), defined as 


(6.26) 


TZ(X){x e K / : 3 a such as Xa = jc}, 


(6.27) 
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is expressed as 

1Z(X) = span{«i,..., u r }. (6.28) 

• Everything that has been said before transfers to complex-valued data, trivially, by replacing trans- 
position with the Hermitian one. 


6.5 RIDGE REGRESSION: A GEOMETRIC POINT OF VIEW 

In this section, we shed light on the ridge regression task from a different perspective. Instead of a dry 
optimization task, we are going to look at it by mobilizing statistical geometric arguments that the SVD 
decomposition offers to us. 

Ridge regression was introduced in Chapter 3 as a means to impose bias on the LS solution and 
also as a major path to cope with overfitting and ill-conditioning problems. In ridge regression, the 
minimizer results as 

0 r — argtnin j ||y — XO \\ 2 + k||0|| 2 J , 

where X > 0 is a user-defined parameter that Controls the importance of the regularizing term. Taking 
the gradient with respect to 6 and equating to zero results in 

0 R = {X T X + XI)~'X T y. (6.29) 

Looking at (6.29), we readily observe (a) its “stabilizing” effect from the numerical point of view, when 
X 1 X is ill-conditioned and its inversion poses problems, and (b) its biasing effect on the (unbiased) 
LS solution. Note that ridge regression provides a solution even if X T X is not invertible, as is the case 
when N < I. Let us now employ the SVD expansion of (6.14) in (6.29). Assuming a full column rank 
matrix X , we obtain (Problem 6.14) 

y = X0 R = U,D(D 2 + XI)~'DU^ y, 


or 




of 


X + er- 


(ujy)ui : 


ridge regression shrinks the weights. 


(6.30) 


Comparing (6.30) and (6.19), we observe that the components of the projection of y onto the 
span{«i,..., it/} (spanjxj ,..., xj{) are shrunk with respect to their LS counterpart. Moreover, the 
shrinking level depends on the singular values 07 ; the smaller the value of 07 , the higher the shrinking 
of the corresponding component. Let us now turn our attention to the investigation of the geomet¬ 
ric interpretation of this algebraic finding. This small diversion will also provide more insight in the 
interpretation of a, and ii/, i = 1,2,...,/, which appear in the SVD method. 

Recall that X T X is a scaled version of the sample covariance matrix for centered regressors. Also, 
by the dehnition of the a, s, we have 


(X T X)v/ = ofvj , i = 1,2,...,/, 
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Xi 


FIGURE 6.4 

The singular vector »i, which is associated with the singular value <ri >02, points to the direction where most of 
the (variance) activity in the data space takes place. The variance in the direction of V2 is smaller. 


and in a compact form, 


C X T X)Vi = V, diagfrrf,..., <r, 2 } => 

(X T X) = ViD 2 V l T =Y^a? v i v J, (6.31) 

i=i 

where the orthogonality property of V/ has been used for the inversion. Note that in (6.31), the (scaled) 
sample covariance matrix is written as a sum of rank one matrices, VjvJ, each one weighted by the 
square of the respective singular value, oj. We are now close to revealing the physical/geometric 
meaning of the singular values. To this end, define 


q ■= Xvj = 


x\vj 


. X N V J J 




j — 1 , 2 ,...,/. 


(6.32) 


Note that q ; is a vector in the column space of X. Moreover, the respective squared norm of q ; is given 

by 

qj(n) = q T j q j = v] X 1 Xvj = v] (^trfvivf 

n= 1 \/=l 

due to the orthonormality of the v j s. That is, crj is equal to the (scaled) sample variance of the elements 
of q j. However, by the definition in (6.32), this is the sample variance of the projections of the input 
vectors (regressors), x n , n = 1, 2,..., N, along the direction vj. The larger the value of <Jj, the larger 
the spread of the (input) data along the respective direction. This is shown in Fig. 6.4, where eri <72- 
From the variance point of view, v \ is the more informative direction, compared to V 2 - It is the direction 
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where most of the activity takes place. This observation is at the heart of dimensionality reductiori, 
which will be treated in more detail in Chapter 19. Moreover, from (6.18), we obtain 

q j = Xv j = OjU j . (6.33) 

In other words, Uj points in the direction of q j. Thus, (6.30) suggests that while projecting y onto 
the column space of X, the directions uj associated with larger values of variance are weighted more 
heavily than the rest. Ridge regression respects and assigns higher weights to the more infonnative 
directions, where most of the data activity takes place. Alternatively, the less important directions, 
those associated with small data variance, are shrunk the most. 

One final comment concerning ridge regression is that the ridge Solutions are not invariant under 
scaling of the input variables. This becomes obvious by looking at the respective equations. Thus, in 
practice, often the input variables are standardized to unit variances. 

PRINCIPAL C0MP0NENTS REGRESSION 

We have just seen that the effect of the ridge regression is to enforce a shrinking rule on the parameters, 
which decreases the contribution of the less important of the components «, in the respective summa- 
tion. This can be considered as a soft shrinkage rule. An alternative path is to adopt a hard thresholding 
rule and keep only the m most significant directions, known as the principal axes or directions, and 
forget the rest by setting the respective weights equal to zero. Equivalently, we can write 


m 

y = Yl dtUi, 

i=t 


(6.34) 


where 

Qj=u]y, i = \,2,...,m. (6.35) 

Furthermore, employing (6.18) we have 

m §■ 

y = J2~ Xv i’ (6.36) 

i= 1 1 

or equivalently, the weights for the expansion of the solution in terms of the input data can be expressed 
as 

m a- 

0 = Y~ l ’i- (63? ) 

In other words, the prediction y is performed in a subspace of the column space of X, which is spanned 
by the m principal axes, that is, the subspace where most of the data activity takes place. 
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6.6 THE RECURSIVE LEAST-SQUARES ALGORITHM 

In previous chapters, we discussed the need for developing recursive algorithms that update the esti- 
mates every time a new pair of input-output observations is received. Solving the LS problem using 
a general purpose solver would amount to 0(Z 3 ) multiplications and additions (MADS), due to the 
involved matrix inversion. Also, 0(NI 2 ) operations are required to compute the (scaled) sample co- 
variance matrix X T X. In this section, the special structure of X T X will be taken into account in order 
to obtain a computationally efficient Online scheme for the solution of the LS task. Moreover, when 
dealing with time-recursive techniques, one can also care for time variations of the statistical properties 
of the involved data. We will allow for such applications, and the sum of squared errors cost will be 
slightly modified in order to accommodate time-varying environments. 

For the needs of the section, our notation will be slightly “enriched” and we will use explicitly the 
time index, n. Also, to be consistent with the Online schemes discussed in Chapter 5, we will assume 

that the time starts at n — 0 and the received observations are (y n ,x n ), n = 0, 1, 2,_To this end, let 

us denote the input matrix at time n as 

X„ = [x 0 ,xi, 


Moreover, the cost function in Eq. (6.1) is modified to involve a forgetting factor, ()</!< 1. The 
purpose of its presence is to help the cost function slowly forget past data samples by weighting heavier 
the more recent observations. This will equip the algorithm with the ability to track changes that occur 
in the underlying data statistics. Moreover, since we are interested in time-recursive Solutions, starting 
from time n — 0, we are forced to introduce regularization. During the initial period, corresponding to 
time instants n <1 — 1, the corresponding system of equations will be underdetermined and X T n X n is 
not invertible. Indeed, we have 

xJ l X n = Y jX ,xJ. 

i =0 

In other words, X T n X n is the sum of rank one matrices. Hence, for n < / — 1 its rank is necessarily 
less than Z, and it cannot be inverted (see Appendix A). For larger values of n, it can become full rank, 
provided that at least Z of the input vectors are linearly independent, which is usually assumed to be the 
case. The previous arguments lead to the following modifications of the “conventional” least-squares, 
known as the exponentially weighted sum of squared errors cost function, minimized by 


(6.38) 


where /3 is a user-defined parameter very close to unity, for example, /i = 0.999. In this way, the more 
recent samples are weighted heavier than the older ones. Note that the regularizing parameter has been 
made time-varying. This is because for large values of n, no regularization is required. Indeed, for 
n > Z, matrix Xj t X n becomes, in general, invertible. Moreover, recall from Chapter 3 that the use of 
regularization also takes precautions for overfitting. However, for very large values of n Z, this is not 
a problem, and one wishes to get rid of the imposed bias. The parameter X. > 0 is also a user-defined 
variable and its choice will be discussed later on. 
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Minimizing (6.38) results in 


= P n > 


where 


and 


= Y,P"~ i x l xJ + X/3 n+l I 


1=0 


Pn = J2^ ' Xiyi ' 


;=0 


which for j J > — I coincides with the ridge regression. 


TIME-ITERATIVE C0MPUTATI0NS 

By the respective definitions, we have 


$11 = £$„-1 + x „ xj 1 . 


and 


Pn = PPn-\ +X n yn- 
Recall Woodbury’s matrix inversion formula (Appendix A.l), 

(A + BD~ l C)~ l — A~ l - A~ l B(D + CA~ l B)~ l CA~ l . 

Plugging it in (6.42), after the appropriate inversion and substitutions we obtain 


K l =r l ®-\-r l Kx T n <s>-\, 


k n — 


^ l ®n-l X n 


1 + j 6 l x ; x n 

The term k n is known as the Kalman gain. For notational convenience, dehne 

P« = *?- 

Also, rearranging the terms in (6.45), we get 

k n = (p~ l P „-1 - P~ l k„xlP„-ijx n , 

and taking into account (6.44) results in 

kn — PnXn ■ 


(6.39) 

(6.40) 

(6.41) 


(6.42) 

(6.43) 


(6.44) 

(6.45) 


(6.46) 
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TIME UPDATING OF THE PARAMETERS 

From (6.39) and (6.43)-(6.45) we obtain 

0„ = 1 - P~ l k n xlP n -^j 0p n _i + P„x n y n 

= On —i k n x n 0 n — i 4- k n y n , 


and finally, 


0 n — 0) j—i + k)i £)) , 


(6.47) 


where, 

e n '-=yn-0 T n _ x x n . (6.48) 

The derived algorithm is summarized in Algorithm 6.1 . 

Note that the basic recursive update of the vector of the parameters follows the same rationale as 
the LMS and the gradient descent schemes that were discussed in Chapter 5. The updated estimate of 
the parameters 0 n at time n equals that of the previous time instant, n — 1, plus a correction term that 
is proportional to the error e n . As a matter of fact, this is the generic scheme that we are going to meet 
for all the recursive algorithms in this book, including those used for the very “trendy” case of neural 
networks. The main difference from algorithm to algorithm lies in how one computes the multiplicative 
factor for the error. In the case of the LMS, this factor was equal to /ix„. For the case of the RLS, this 
factor is k n . As we shall soon see, the RLS algorithm is closely related to an alternative to the gradient 
descent family of optimization algorithms, known as Newton’s iterative optimization of cost functions. 


Algorithm 6.1 (The RLS algorithm). 

• Initialize 

- 0 _i=O; any other value is also possible. 

- P-i — X~ l I\ A > 0 a user-defined variable. 

- Select/I; closetol. 

• For n=0, 1,..., Do 

~ L/ = yn 0 n _ \X n 

“ Z)i = P>1 I X ti 


0,i — 0)i —i k n e n 

- p„=r l Pn-i- r l Kzi 

• End For 


Remarks 6.2. 

• The complexity of the RLS algorithm is of the order 0{l 2 ) per iteration, due to the matrix-product 
operations. That is, there is an order of magnitude difference compared to the LMS and the other 
schemes that were discussed in Chapter 5. In other words, RLS does not scale well with dimension- 
ality. 
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• The RLS algorithm shares similar numerical behavior with the Kalman filter, which was discussed 

in Section 4.10; P n may lose its positive definite and symmetric nature, which then leads the al¬ 
gorithm to divergence. To remedy such a tendency, symmetry preserving versions of the RLS 
algorithm have been derived; see [65,68]. Note that the use of p < 1 has a beneficial effect on 
the error propagation [30,34]. In [58], it is shown that for /1 = 1 the error propagation mechanism 
is of a random walk type, and hence the algorithm is unstable. In [5], it is pointed out that due to 
numerical errors the term -— yy> - ma y become negative, leading to divergence. The numerical 

p+Xft r, l -\X n 

performance of the RLS becomes a more serious concern in implementations using limited preci- 
sion, such as fixed point arithmetic. Compared to the LMS, RLS would require the use of higher 
precision implementations; otherwise, divergence may occur after a few iteration steps. This adds 
further to its computational disadvantage compared to the LMS. 

• The choice of X in the initialization step has been considered in [46]. The related theoretical analysis 
suggests that X has a direct influence on the convergence speed, and it should be chosen so as to be 
a small positive for high signal-to-noise (SNR) ratios and a large positive constant for low SNRs. 

• In [56], it has been shown that the RLS algorithm can be obtained as a special case of the Kalman 
filter. 

• The main advantage of the RLS is that it converges to the steady state much faster than the LMS 
and the rest of the members of the LMS family. This can be justified by the fact that the RLS can 
been seen as an offspring of Newton’s iterative optimization method. 

• Distributed versions of the RLS have been proposed in [8,39,40]. 


6.7 NEWT0N’S ITERATIVE MINIMIZATION METHOD 

The gradient descent formulation was presented in Chapter 5. It was noted that it exhibits a linear 
convergence rate and a heavy dependence on the condition number of the Hessian matrix associated 
with the cost function. Newton’s method is a way to overcome this dependence on the condition number 
and at the same time improve upon the rate of convergence toward the solution. 

In Section 5.2, a first-order Taylor expansion was used around the current value J Let 

us now consider a second-order Taylor expansion (assume m = 1) that involves second-order deriva¬ 
tive s, 


J ^0 (i-1) + A0^\ = /(V i_1, ) + (v/(0 ( '“ 1) )) y A0 (i) 

+ ] - (a 0 (i) Y V 2 y (V' _1) ) A6> (0 . 

Recall that the second derivative, V 2 /, being the derivative of the gradient vector, is an l x / matrix 
(see Appendix A). Assuming that V 2 J 1 A to be positive definite (this is always the case if J(0) 
is a strictly convex function ), the above turns out to be a convex quadratic function with respect to the 


5 


See Chapter 8 for related definitions. 
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step Athe latter is computed so as to minimize the above second-order approximation. Due to the 
convexity, there is a single minimum, which results by equating the corresponding gradient to 0 , which 
results in 

A6» (!) = -(V 2 ./(V / “ 1) )) 1 V/(6.49) 

Note that this is indeed a descent direction, because 

S7 t J A0 (/) = -V r / * V/(0 (i_1) ) < 0, 

due to the positive definite nature of the Hessian 6 ; equality to zero is achieved only at a minimum, 
where the gradient becomes zero. Thus, the iterative scheme takes the following form: 


0(0 = ffd -D _ )M (y 2 / 1 Vi : 


Newton’s iterative scheme. 


(6.50) 


Fig. 6.5 illustrates the method. Note that if the cost function is quadratic, then the minimum is achieved 
in one iterationi 

As is apparent from the derived recursion, the difference of Newton's schemes with the gradient 
descent family of algorithms lies in the step size. This is no more a scalar, i.e., /i/. The step size 
involves the inverse of a matrix; that is, the second derivative of the cost function with respect to the 
parameters’ vector. Here lies the power and at the same time the drawback of this type of algorithms. 
The use of the second derivative provides extra information concerning the local shape of the cost (see 
below) and as a consequence leads to faster convergence. At the same time, it increases substantially 



FIGURE 6.5 

According to Newton’s method, a local quadratic approximation of the cost function is considered (red curve), and 
the correction pushes the new estimate toward the minimum of this approximation. If the cost function is quadratic, 
then convergence can be achieved in one step. 


6 Recall from Appendix A that if A is positive definite, then x T Ax > 0, V x. 











6.7 NEWTON’S ITERATIVE MINIMIZATION METHOD 


273 



FIGURE 6.6 

The graphs of the unit Euclidean (black circle) and quadratic (red ellipse) norms centered at 0 |I-1) are shown. In 
both cases, the goal is to move as far as possible in the direction of — V/(0 <,_1) ), while remaining at the ellipse 
(circle). The resuit is different for the two cases. The Euclidean norm corresponds to the gradient descent and the 
quadratic norm to Newton’s method. 


the computational complexity. Also, matrix inversions need always a careful treatment. The associated 
matrix may become ill-conditioned, its determinant gets small values, and in such cases numerical 
stability issues have to be carefully considered. 

Observe that in the case of Newton’s algorithm, the correction direction is not that of 180° with 
respect to V/(0 (,-1) ), as is the case for the gradient descent method. An alternative point of view is to 
look at (6.50) as the steepest descent direction under the following norm (see Section 5.2): 

II U II p = (v T Pv) 1/2 , 

where P is a symmetric positive definite matrix. For our case, we set 

P = V 2 J (0 (!_1) V 

Then searching for the respective normalized steepest descent direction, i.e., 

v = argminz 7 
s.t. ||z||p = 1, 

results in the normalized vector pointing in the same direction as the one in (6.49) (Problem 6.15). 
For P = 7, the gradient descent algorithm results. The geometry is illustrated in Fig. 6.6. Note that 
Newton’s direction accounts for the local shape of the cost function. 

The convergence rate for Newton’s method is, in general, high and it becomes quadratic close to 
the solution. Assuming 9 * to be the minimum, quadratic convergence means that at each iteration i, 
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the deviation from the optimum value follows the following pattern: 


lnln 


l|0 (O -<MI 2 


oc i : quadratic convergence rate. 


(6.51) 


In contrast, for the linear convergence, the iterations approach the optimal according to 


ln 


|| 0 (/) - 0*|| 2 


oc i : linear convergence rate. 


(6.52) 


Furthermore, the presence of the Hessian in the correction term remedies, to a large extent, the 
influence of the condition number of the Hessian matrix on the convergence [6] (Problem 6.16). 


6.7.1 RLS AND NEWT0N’S METHOD 

The RLS algorithm can be rederived following Newton’s iterative scheme applied to the MSE and 
adopting stochastic approximcition arguments. Let 

J{0) = i E [(y - 0 r x) 2 ] = i<7 2 + \o t e x o - e T p , 
or 

-V7(0) = p - S x 0 = E[xy] - E[xx r ]0 = E[x(y - x T 0 )] = E[xe], 

and 

V 2 J(0)=Z x . 


Newton’s iteration becomes 

0 (,) = e^-^ + ^E- 1 E[xe]. 

Following stochastic approximation arguments and replacing iteration steps with time updates and 
expectations with observations, we obtain 

0n = 0n— 1 5“ pn^ x 


Let us now adopt the approximation, 

1 


- 


/7 + 1 


= 




n+ 1, 


!=0 / 


and set 


Then 


pn — 


1 

n + 1 


0n — ^n— 1 T > 
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with 


k — P 

— i n 


where 

Pn=\Y^P n ~ i X i xJ + ^" +1 / 

\i =0 

which then, by using similar steps as for (6.42)-(6.44), leads to the RLS scheme. 

Note that this point of view explains the fast converging properties of the RLS and its relative 
insensitivity to the condition number of the covariance matrix. 

Remarks 6.3. 

• In Section 5.5.2, it was pointed out that the LMS is optimal with respect to a min/max robust- 
ness criterion. However, this is not true for the RLS. It turns out that while LMS exhibits the best 
worst-case performance, the RLS is expected to have better performance on average [23]. 



6.8 STEADY-STATE PERFORMANCE OF THE RLS 

Compared to the stochastic gradient techniques, which were considered in Chapter 5, we do not have 
to worry whether RLS converges and where it converges. The RLS computes the exact solution of the 
minimization task in (6.38) in an iterative way. Asymptotically and for /i = 1, /- = 0 RLS solves the 
MSE optimization task. However, we have to consider its steady-state performance for fi ^ 1. Even 
for the stationary case, fi ^ 1 results in an excess mean-square error. Moreover, it is important to get 
a feeling of its tracking performance in time-varying environments. To this end, we adopt the same 
setting as the one followed in Section 5.12. We will not provide all the details of the proof, because 
this follows similar steps as in the LMS case. We will point out where differences arise and state the 
results. For the detailed derivation, the interested reader may consuit [15,48,57]; in the latter one, the 
energy conservation theory is employed. 

As in Chapter 5, we adopt the following models: 

y« = Oj„_iX„ +r)„ (6.53) 


and 


^ o,n — ^o,n — 1 T , (6.54) 

with 

E[w„m;[] = r w . 

Hence, taking into account (6.53), (6.54), and the RLS iteration involving the respective random vari- 
ables, we get 


^ o,n — ^n —1 L., L' (; (*)«, 
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Table 6.1 The steady-state excess MSE, for small values of /i 
and p. 

Algorithm 

Excess MSE, J exc , at steady state 

LMS 

±firftrace{E x } + 1/Lt _1 trace{i: ffl } 

APA 

f/xo-^tracefi^lE + jM -1 trace{.£;Jtrace{.£ ffl } 

RLS 

1(1 - P)tfl + 1(1 - py 1 tracel^-T*} 

For q = 1, the normalized LMS results. Under a Gaussian input assumption 

and for long system orders, l, in the APA, E 2 ) J- 

Cn 

•— ® n ® o,n — Cn — l P n^-n&n 

= )Ch— i -|- T* n X n \\ n (0^, 


which is the counterpart of (5.76). Note that the time indices for the input and the noise variables can 
be omitted, because their statistics is assumed to be time-invariant. 

We adopt the same assumptions as in Section 5.12. In addition, we assume that P„ is changing 
slowly compared to c„. Hence, every time P„ appears inside an expectation, it is substituted by its 
mean E[P„], i.e., 

E[P„]=E[d>,7 1 ], 

where 

n 

<b„ = xp"+ l I + J2 P n ~‘xixJ 


I _ R n +1 

E [<t> n ] = xp n+1 I + p S x . 

1 - p 

Assuming /3 ~ 1, the variance at the steady state of <f>„ can be considered small and we can adopt the 
following approximation: 


E[P„] ~ [EtO,,]]- 1 = 

Based on ali the previously stated assumptions, repeating carefully the same steps as in Section 5.12, 
we end up with the resuit shown in Table 6.1, which holds for small values of p. For comparison 
reasons, the excess MSE is shown together with the values obtained for the LMS as well as the APA 
algorithms. In stationary environments, one simply sets E (0 = 0. 

According to Table 6.1, the following remarks are in order. 


p n+l u + 


Qfl+l 
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Remarks 6.4. 

• For stationary environments, the performance of the RLS is independent of E x . Of course, if one 
knows that the environment is stationary, then ideally /i = I should be the choice. Yet recall that for 
P = 1, the algorithm has stability problems. 

• Note that for small // and fi ~ 1, there is an “equivalence” of p . ~ 1 — /J, for the two parameters in 
the LMS and RLS. That is, larger values of /i are beneficial to the tracking performance of LMS, 
while smaller values of fi are required for faster tracking; this is expected because the algorithm 
forgets the past. 

• It is ciear from Table 6.1 that an algorithm may converge to the steady state quickly, but it may 
not necessarily track fast. It all depends on the specific scenario. For example, under the modeling 
assumptions associated with Table 6.1 , the optimal value p opt for the LMS (Section 5. 12) is given by 


Mopt — 


tracefli^} 


trace{ E x } ’ 


which corresponds to 

Jm m S = J a^tmce{E x }tmce{E M }. 
Optimizing with respect to fi for the RLS, it is easily shown that 


Aopt — i 


I tracefX^Tj 

cr-j 1 


S = J^ltrdce{E m E x }. 


Hence, the ratio 


rLMS 

J min 

/ trace{ XT }trace{ E 0J \ 

y RLS “ V 

min V 

ltrdce.{E 0 jE x } 


depends on E 0) and E x . Sometimes LMS tracks better, yet in other problems RLS is the winner. 
Having said that, it must be pointed out that the RLS always converges faster, and the difference in 
the rate, compared to the LMS, increases with the condition number of the input covariance matrix. 


6.9 COMPLEX-VALUED DATA: THE WIDELY LINEAR RLS 

Following similar arguments as in Section 5.7, let 



with 
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yn — %n • 

The associated regularized cost becomes 

n 

J(<P) = ^2p n ~ l (yn - <p H Xn)(y n - <P H Xn)* + W n + l <p H <P, 
i= 0 

or 

n n 

j (<p)=j2 p n ~'\ yn \ 2 +'y2p n ~‘ <pH * ni » <p 

i= 0 /= 0 

n n 

-y^P n ~ l ynXn<P-y^P n ~'V H X n yl + W l+l V H <P- 

i =0 i=0 

Taking the gradient with respect to <p* and equating to zero, we obtain 


= p n : widely linearLS estimate, 


where 


(6.55) 


o,, = A" +1 k/ + ^ 

i=0 
n 

Pn=y^P n ~ , X n yt- 

(=0 

Following similar steps as for the real-valued RLS, Algorithm 6.2 results, where P n := <f>~* 

Algorithm 6.2 (The widely linear RLS algorithm). 

• Initialize 

- <Po = ° 

- 

- Select A 

• For n — 0, 1,2,..., Do 

— e n = y n <P n _i 
~ Zn = Pn—l^n 


~ <Pn ~ Vn-l + k n e* 

- P n =r l Pn-l- r l k n z% 

• End For 


(6.56) 

(6.57) 


Setting v n — 0 and replacing x n with x n and <p n with 0 n , the linear complex-valued RLS results. 
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6.10 COMPUTATIONAL ASPECTS OF THE LS SOLUTION 

The literature concerning the efficient solution of the LS method as well as the computationally ef¬ 
ficient implementation of the RLS is huge. In this section, we will only highlight some of the basic 
directions that have been followed over the years. Most of the available Software packages implement 
such efficient schemes. 

A major direction in developing various algorithmic schemes was to cope with the numerical stabil- 
ity issues, as we have already discussed in Remarks 6.2. The main concern is to guarantee the symmetry 
and positive definiteness of <t>„. The path followed toward this end is to work with the square root fac- 
tors of <t>„. 

CHOLESKY FACTORIZATION 

It is known from linear algebra that every positive definite symmetric matrix, such as <J>„, accepts the 
following factorization: 

^‘n ^‘n i 

where L„ is lower triangular with positive entries along its diagonal. Moreover, this factorization is 
unique. 

Concerning the LS task, one focuses on updating the factor L n , instead of <t>„, in order to improve 
numerical stability. Computation of the Cholesky factors can be achieved via a modified version of the 
Gauss elimination scheme [22]. 

QR FACTORIZATION 

A better option for computing square factors of a matrix, from a numerical stability point of view, is 
via the QR decomposition method. To simplify the discussion, let us consider /3 = 1 and /, = 0 (no 
regularization). Then the positive definite (sample) covariance matrix can be factored as 

From linear algebra [22], we know that the {n + 1) x / matrix U„ can be written as a product 

Un = Qn Rn > 

where Q n is an (n + 1) x (n + 1) orthogonal matrix and R n is an (n + 1) x / upper triangular matrix. 
Note that R n is related to the Cholesky factor L T n . It turns out that working with the QR factors of U n 
is preferable, with respect to numerical stability, to working on the Cholesky factorization of <L>„. QR 
factorization can be achieved via different paths: 

• Gram-Schmidt orthogonalization of the input matrix columns. We have seen this path in Chapter 4 
while discussing the lattice-ladder algorithm for solving the normal equations for the filtering case. 
Under the time shift property of the input signal, lattice-ladder-type algorithms have also been 
developed for the LS filtering task [31,32]. 

• Givens rotations: This has also been a popular line [10,41,52,54]. 
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• Householder reflections: This line has been followed in [53,55]. The use of Householder reflec- 
tions leads to a particularly robust scheme from a numerical point of view. Moreover, the scheme 
presents a high degree of parallelism, which can be exploited appropriately in a parallel processing 
environment. 

A selection of related to QR factorization review papers is given in [2]. 

FAST RLS VERSIONS 

Another line of intense activity, especially in the 1980s, was that of exploiting the special structure 
associated with the filtering task; that is, the input to the filter comprises the samples from a realization 
of a random signal/process. Abiding by our adopted notational convention, the input vector will now 
be denoted as u instead of x. Also, for the needs of the discussion we will bring into the notation the 
order of the filter, m. In this case, the input vectors (regressors) at two successive time instants share 
ali but two of their components. Indeed, for an mth-order system, we have 



Ufi 


^71—1 

Ufii,n — 


5 Mm,n— 1 — 



J^n—m- 1-1_ 


—Mn—m — 


and we can partition the input vector as 

Wm,n — \ l hi Mm— l,n— l] = —m+l] 

This property is also known as time shift structure. Such a partition of the input vector leads to 




E 

i=0 


E “m —1 

i=0 


E Uj —i.ii —i 

1=0 


^111 — 1,h E Um-ljUi-m + l 

i=0 

n n 

E U m-U u i-m+i E u i-m+l 

.1=0 i '=0 


, m = 2, 3 ,..., l, 


(6.58) 


where for complex variables transposition is replaced by the Hermitian one. Compare (6.58) with 
(4.60). The two partitions look alike, yet they are different. Matrix O,,,,, is no longer Toeplitz. Its low 
partition is given in terms of Such matrices are known as near-to-Toeplitz. Ali that is needed 

is to “correct” <t> m _ i,„_i back to i „ subtracting a rank one matrix, i.e., 

T 

l,n —1 = ^/ii —1,17 ^m—l,ntt fn —l,n' 

It turns out that such corrections, although they may slightly complicate the derivation, can stili lead to 
computational efficient order recursive schemes, via the application of the matrix inversion lemma, as 
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was the case in the MSE of Section 4.8. Such schemes have their origin in the pioneering PhD thesis of 
Martin Morf at Stanford [42]. Levinson-type, Schur-type, split-Levinson-type, and lattice-ladder algo- 
rithms have been derived for the LS case [3,27,28,43,44,60,61]. Some of the schemes noted previously 
under the QR factorization exploit the time shift structure of the input signal. 

Besides the order recursive schemes, a number of fixed-order fast RLS-type schemes have been 
developed following the work in [33]. Recall from the definition of the Kalman gain in (6.45) that for 
an /th-order system we have 


kl+\,n — ^/+1 l,n — 


* * 

-1 

* 

* ( &l,n-l_ 


u l,n— 1 _ 


$/,,! *" 

-1 

^l,n 

* * 


* 


where * denotes any value of the element. Without going into detail, the low partition can relate the 
Kalman gain of order l and time n — 1 to the Kalman gain of order / + 1 and time n (step up). Then the 
upper partition can be used to obtain the time update Kalman gain at order I and time n (step down). 
Such a procedure bypasses the need for matrix operations leading to 0(1) RLS-type algorithms [7,9] 
with complexity 11 per time update. However, these versions turned out to be numerically unstable. 
Numerically stabilized versions, at only a small extra computational cost, were proposed in [5,58]. 
All the aforementioned schemes have also been developed for solving the (regularized) exponentially 
weighted LS cost function. 

Besides this line, variants that obtain approximate Solutions have been derived in an attempt to 
reduce complexity; these schemes use an approximation of the covariance or inverse covariance matrix 
[14,38]. The fast Newton transversat filter (FNTF) algorithm [45] approximates the inverse covariance 
matrix by a banded matrix of width p. Such a modeling has a specific physical interpretation. A banded 
inverse covariance matrix corresponds to an AR process of order p. Hence, if the input signal can 
sufficiently be modeled by an AR model, FNTF obtains a least-squares performance. Moreover, this 
performance is obtained at O(p) instead of 0(1) computational cost. This can be very effective in 
applications where p <<£ /. This is the case, for example, in audio conferencing, where the input signal 
is speech. Speech can efficiently be modeled by an AR of the order of 15, yet the filter order can be of a 
few hundred taps [49]. FNTF bridges the gap between LMS (p = 1) and (fast) RLS (p — l). Moreover, 
FNTF builds upon the structure of the stabilized fast RLS. More recently, the banded inverse covariance 
matrix approximation has been successfully applied in spectral analysis [21]. 

More on the efficient LS schemes can be found in [15,17,20,24,29,57]. 


6.11 THE COORDINATE AND CYCLIC COORDINATE DESCENT METHODS 

So far, we have discussed the gradient descent and Newton’s method for optimization. We will conclude 
the discussion with a third method, which can also be seen as a member of the steepest descent family 
of methods. Instead of the Euclidean and quadratic norms, let us consider the following minimization 
task for obtaining the normalized descent direction: 

v — argminz 7 V/, 

z 


(6.59) 
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s.t. Ilzlli = 1, (6.60) 

where 11 • 11 1 denotes the i \ norm, defined as 

1 

Ilzlli 

i=i 

Most of Chapter 9 is dedicated to this norm and its properties. Observe that this is not differentiable. 
Solving the minimization task (Problem 6.17) results in 

v = — sgn ((V J)k) ek, 

where ek is the direction of the coordinate corresponding to the component (V./ )i- with the largest 
absolute value, i.e.. 


\(yj) k \>\(vj)j\, jjtk, 

and sgn(-) is the sign function. The geometry is illustrated in Fig. 6.7. In other words, the descent 
direction is along a single basis vector; that is, each time only a single component of 6 is updated. It is 
the component that corresponds to the directional derivative, (v 11 , with the largest increase 

and the update rule becomes 


3/(V'” 0 ) 

0. f,) = oj' 11 — a, - — : coordinate descent scheme, 

k k dd k 

ef= 7 = 1,2,...,;, j^k. 


(6.61) 

(6.62) 



FIGURE 6.7 

The unit norm || • |ll ball centered at is a rhombus (in R 2 ). The direction ei is the one corresponding to the 
largest component of V/. Recall that the components of the vector V 7 are the respective directional derivatives. 
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Because only one component is updated at each iteration, this greatly simplifies the update mechanism. 
The method is known as coordinate descent. 

Based on this rationale, a number of variants of the basic coordinate descent have been proposed. 
The cyclic coordinate descent (CCD) method in its simplest form entails a cyclic update with respect 
to one coordinate per iteration cycle; that is, at the /th iteration the following minimization is solved: 


0*° := argimn J(0 ®,..., 9^,9, 0^\ . 6? \ 


In words, ali components but 6^ are assumed constant; those components Oj. j < k, are fixed to their 
updated values, 9j'\ j = 1, 2,..., k — 1, and the rest, 9 /, j — k + 1./, to the available estimates, 

9j' 1 \ from the previous iteration. The nice feature of such a technique is that a simple closed form 
solution for the minimizer may be obtained. A revival of such techniques has happened in the context 
of sparse learning models (Chapter 10) [18,67]. Convergence issues of CCD have been considered 
in [36,62]. CCD algorithms for the LS task have also been considered in [66] and the references 
therein. Besides the basic CCD scheme, variants are also available, using different scenarios for the 
choice of the direction to be updated each time, in order to improve convergence, ranging from a ran- 
dom choice to a change of the coordinate systems, which is known as an adaptive coordinate descent 
scheme [35]. 


6.12 SIMULATION EXAMPLES 

In this section, simulation examples are presented concerning the convergence and tracking perfor- 
mance of the RLS compared to algorithms of the gradient descent family, which have been derived in 
Chapter 5. 

Example 6.1. The focus of this example is to demonstrate the comparative performance, with respect 
to the convergence rate of the RLS, NLMS, and APA algorithms, which have been discussed in Chap¬ 
ter 5. To this end, we generate data according to the regression model 

yn = @ o 'L/ “E t]n j 


where 0 o e R 200 . Its elements are generated randomly according to the normalized Gaussian. The 
noise samples are i.i.d. generated via the zero mean Gaussian with variance equal to <r 2 = 0.01. The 
elements of the input vector are also i.i.d. generated via the normalized Gaussian. Using the generated 
samples (y, ,, jc n ), n = 0, 1,..., as the training sequence for all three previously stated algorithms, the 
convergence curves of Fig. 6.8 are obtained. The curves show the squared error in dBs (101og 10 (e 2 )), 
averaged over 100 different realizations of the experiments, as a function of the time index n. The 
following parameters were used for the involved algorithms. (a) For the NLMS, we used p = 1.2 and 
<5 = 0.001; (b) for the APA, we used p = 0.2, 8 = 0.001, and q — 30; and (c) for the RLS, we used 
/1=1 and X — 0.1. The parameters for the NLMS algorithm and the APA were chosen so that both 
algorithms converge to the same error floor. The improved performance of the APA concerning the 
convergence rate compared to the NLMS is readily seen. However, both algorithms fall short when 
compared to the RLS. Note that the RLS converges to lower error floor, because no forgetting factor 
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FIGURE 6.8 

MSE curves as a function of the number of iterations for NLMS, APA, and RLS. The RLS converges faster and at 
lower error floor. 


was used. To be consistent, a forgetting factor /3 < 1 should have been used in order for this algorithm 
to settle at the same error floor as the other two algorithms; this would have a beneficial effect on 
the convergence rate. However, having chosen A = 1, it is demonstrated that the RLS can converge 
really fast, even to lower error floors. On the other hand, this improved performance is obtained at 
substantial higher complexity. In case the input vector is part of a random process, and the special 
time shift structure can be exploited, as discussed in Section 6.10, the lower-complexity versions are at 
the disposal of the designer. A further comparative performance example, including another family of 
Online algorithms, will be given in Chapter 8. 

However, it has to be stressed that this notable advantage (between RLS- and LMS-type schemes) in 
convergence speed from the initial conditions to steady state may not be the case concerning the track- 
ing performance, when the algorithms have to track time-varying environments. This is demonstrated 
next. 

Example 6.2. This example focuses on the comparative tracking performance of the RLS and NLMS. 
Our goal is to demonstrate some cases where the RLS fails to do as well as the NLMS. Of course, it 
must be kept in mind that according to the theory, the comparative performance is very much dependent 
on the specific application. 

For the needs of our example, let us mobilize the time-varying model of the parameters given in 
(6.54) in its more practical version and generate the data according to the following linear system: 


y n — -L/ ® o,n —1 “E *ln i 


( 6 . 63 ) 
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FIGURE 6.9 

For a fast time-varying parameter model, the RLS (gray) fails to track it, in spite of its very fast initial convergence, 
compared to the NLMS (red). 


where 


0o,n — OL@o,n — 1 “E (On , 

with 0 o n e R 5 . It turns out that such a time-varying model is closely related (for the right choice of 
the involved parameters) to what is known in Communications as a Rayleigh fading channel, if the 
parameters comprising 0 o n are thought to represent the impulse response of such a channel [57]. 
Rayleigh fading channels are very common and can adequately model a number of transmission chan- 
nels in wireless Communications. Playing with the parameter a and the variance of the corresponding 
noise source, co, one can achieve fast or slow time-varying scenarios. In our case, we chose a — 0.97 
and the noise followed a Gaussian distribution of zero mean and covariance matrix E„, = 0.1 1. 

Concerning the data generation, the input samples were generated i.i.d. from a Gaussian Af(0, 1), 
and the noise was also Gaussian of zero mean value and variance equal to <r" = 0.01. Initialization of 
the time-varying model (0 o ,o) was randomly done by drawing samples from J\f(0, 1). 

Fig. 6.9 shows the obtained MSE curve as a function of the iterations for the NLMS and the RLS. 
For the RLS, the forgetting factor was set equal to /1 = 0.995 and for the NLMS, /i — 0.5 and <5 = 0.001. 
Such a choice resulted in the best performance, for both algorithms, after extensive experimentation. 
The curves are the resuit of averaging out 200 independent runs. 

Fig. 6.10 shows the resulting curves for medium and slow time-varying channels, corresponding to 
E w — 0.01/ and = 0.001/, respectively. 
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FIGURE 6.10 

MSE curves as a function of iteration for (A) a medium and (B) a slow time-varying parameter model. The red 
curve corresponds to the NLMS and the gray one to the RLS. 


6.13 TOTAL LEAST-SQUARES 

In this section, the LS task will be formulated from a different perspective. Assume zero mean (cen- 
tered) data and our familiar linear regression model, employing the observed samples, 

y = X0 + r), 


as in Section 6.2. We have seen that the LS task is equivalent to (orthogonally) projecting y onto the 
span{jc'[,..., x‘i ) of the columns of X, hence making the error 

e = y-y 


orthogonal to the column space of X. Equivalently, this can be written as 

minimize ||e|| 2 , 

s.t. y-eelZ(X), (6.64) 

where 'R,(X) is the range space of X (see Remarks 6.1 for the respective dehnition). Moreover, once 
6 ls has been obtained, we can write 

y = X0 LS = y -e 


or 


#LS 

-1 


= 0 , 


[X:y-e] 


(6.65) 
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where [X : y — e] is the matrix that results after extending X by an extra column, y — e. Thus, ali the 
points ( y„ — e n ,x„) e R /+l , n = 1,2,..., /V, lie on the same hyperplane, Crossing the origin, as shown 
in Fig. 6.11. In other words, in order to fit a hyperplane to the data, the LS method applies a correction 
e„, n = 1,2 ,.N, to the output samples. Thus, we have silently assumed that the regressors have 
been obtained via exact measurements and that the noise affects only the output observations. 



FIGURE 6.11 

According to the LS method, only the output points y„ are corrected to y n — e n , so that the pairs (y„ — e„,x n ) lie on 
a hyperplane, Crossing the origin for centered data. If the data are not centered, it crosses the centroid, (y, x). 

In this section, the more general case will be considered, where we allow both the input (regressors) 
and the output variables to be perturbed by (unobserved) noise samples. Such a treatment has a long 
history, dating back to the 19th century [1]. The method remained in obscurity until it was revived 50 
years later for two-dimensional models by Deming [13]; it is sometimes known as Deming regression 
(see also [ 19] for a historical overview). Such models are also known as errors-in-variables regression 
models. 

Our kick-off point is the formulation in (6.64). Let e be the correction vector to be applied on y 
and E the correction matrix to be applied on X. The method of total lecist-squares (TLS) computes the 
unknown parameter vector by solving the following optimization task: 


minimize ||[£:e]|| f ., 

s.t. y — eelZ(X — E). (6.66) 

Recall (Remarks 6.1) that the Frobenius norm of a matrix is defined as the square root of the sum of 
squares of ali its entries, and it is the direct generalization of the Euclidean norm defined for vectors. 
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Let us first focus on solving the task in (6.66), and we will comment on its geometric interpretation 
later on. The set of constraints in (6.66) can equivalently be written as 


(X - E)0 = y - e. 


(6.67) 


Define 


( 6 . 68 ) 


F :=X — E, 

and let f] el 7 , i = 1,2,..., N, be the rows of F, i.e., 


F T = [f l ,...,f N l 


and /? e , i — 1,2,... ,1, the respective columns, i.e., 


Let also 


(6.69) 


g ■— y — e. 


Hence, (6.67) can be written in terms of the columns of F, i.e., 


eif\ + --- + e l f c l -g = Q. 


(6.70) 


Eq. (6.70) implies that the / + 1 vectors, f c l ,...,f‘j,ge R N , are linearly dependent, which in turn 
dictates that 


rank{[E : g]} <1. 


(6.71) 


There is a subtle point here. The opposite is not necessarily true; that is, (6.71) does not necessarily 
imply (6.70). If the rank{ F} < /, there is, in general, no 6 to satisfy (6.70). This can easily be verified, 
for example, by considering the extreme case where /) = f c 2 — ■ ■ ■ = /J . Keeping that in mind, we 
need to impose some extra assumptions. 

Assumptions: 

1. The N x I matrix X is full rank. This implies that all its singular values are nonzero, and we can 
write (recall (6.14)) 



i=t 


where we have assumed that 


(J\ > <72 > ■ ■ ■ > (J/ > 0, 


(6.72) 
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2. The (N x (/ + 1)) matrix [X : y] is also full rank; hence 

/+1 

[X : y] = ^2a i u l vJ 

i=i 


with 


oh > er 2 > ... > er /+1 > 0. 


(6.73) 


3. Assume that 


07+1 < er/. 

As we will see soon, this guarantees the existence of a unique solution. If this condition is not valid, 
Solutions can stili exist; however, this corresponds to a degenerate case. Such Solutions have been 
the subject of study in the related literature [37,64]. We will not deal with such cases here. Note that, 
in general, it can be shown that er /+1 < er/ [26]. Thus, our assumption demands striet inequality. 

4. Assume that 


er/ > er /+1 . 

This condition will also be used in order to guarantee uniqueness of the solution. 
We are now ready to solve the following optimization task: 

minimize || [X : y] - [F : g] f 
F.g 

s.t. rank{[F : g]} = /. 


(6.74) 


In words, compute the best, in the Frobenius norm sense, rank / approximation, [/-’ : «], to the (rank 
/ + 1) matrix [X : y]. We know from Remarks 6.1 that 


[F : g] = '^a i Uivf, 
i=1 


and consequently 


[E : e] = < 7 /+iM/ + iu / 7 fl , 

with the corresponding Frobenius and spectral norms of the error matrix being equal to 


(6.75) 


(6.76) 


| E : e\ F =oi+i = | E :e 

Note that the above choice is unique, because 07 + | < 07 . 


(6.77) 
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So far, we have uniquely solved the task in (6.74). However, we stili have to recover the estimate 
#tls> which will satisfy (6.70). In general, the existence of a unique vector cannot be guaranteed from 
the F and g given in (6.75). Uniqueness is imposed by assumption (3), which guarantees that the rank 
of F is equal to /. 

Indeed, assume that the rank of F , k, is less than /, i.e., k < l. Let the best (in the Frobenius/spectral 
norm sense) rank k approximation of X be X k and X — Xf : — E k . We know from Remarks 6.1 that 


l 

\\E k \\F= a f- a l- 

\| i=k+1 

Also, because E k is the perturbation (error) associated with the best approximation, we have 

II-EII f > II^TIIf 


or 


E\\p >(T[. 


However, from (6.77) we have 


ct/+i = ||£ : e\\ F > ||£||f > cr/, 

which violates assumption (3). Thus, rankj F) = l. Hence, there is a unique such that 


[F:*] 


#TLS 

-1 


= 0 . 


(6.78) 


In other words, [<?tlS’ — 1 ] 7 belongs to the null space of 


F-g 


(6.79) 


which is a rank-deficient matrix; hence, its null space is of dimension one, and it is easily checked that 
it is spanned by D; + j, leading to 

-1 

~ 7i i 

Vl+1 0 + 1) 

where vi+\{l + 1) is the last component of f/ + i. Moreover, it can be shown that (Problem 6.18) 


#TLS 

-1 


0 TL s = (X T X — <7^| /) 1 X 1 y : total least-squares estimate. 


(6.80) 


Note that Assumption (3) guarantees that X 1 X — of- +{ I is positive definite (think of why this is so). 
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GEOMETRIC INTERPRETATION OF THE TOTAL LEAST-SQUARES METHOD 


From (6.67) and the definition of F in terms of its rows, /[,..., /T, we get 


fl Otls - gn — 0, n=\,2,...,N, 


(6.81) 


or 



(6.82) 


In words, both the regressors x„ and the outputs y„ are corrected in order for the points (y n — e „, x n — 
e n ), n = 1,2,..., /V, to lie on a hyperplane in R /+l . Also, once such a hyperplane is computed and 
it is unique, it has an interesting interpretation. It is the hyperplane that minimizes the total square 
distance of ali the training points (y „, x„) from it. Moreover, the corrected points (y„ — e n , x n — e„) — 
(g n , /„), n = 1, 2,..., N, are the orthogonal projectioris of the respective (y n , x „) training points onto 
this hyperplane. This is shown in Fig. 6.12. 


(yi -ei,xi - ei) 



(Vn e n , x n e n ) 


FIGURE 6.12 


The total least-squares method corrects both the values of the output variable as well as the input vector so that the 
points, after the correction, lie on a hyperplane. The corrected points are the orthogonal projections of ( y n , jc„ ) on 
the respective hyperplane; for centered data, this crosses the origin. For noncentered data, it crosses the centroid 
(y,x). 

To prove the previous two claims, it suffices to show (Problem 6.19) that the direction of the hyper¬ 
plane that minimizes the total distance from a set of points (y n ,x n ), n — 1,2,..., N, is that defined 

by vi+ 1 ; the latter is the eigenvector associated with the smallest singular value of [X : y], assuming 


> °7+i- 
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To see that (g n , f n ) is the orthogonal projection of (y n , x n ) on this hyperplane, recall that our task 
minimizes the following Frobenius norm: 


N 

\\X-F\y-g\\ 2 F = Y i ((yn-gn) 2 + \\xn-f„\\ 2 ). 

n= 1 


Fiowever, each one of the terms in the above summation is the Euclidean distance between the points 
(y n , x n ) and (g„,/„), which is minimized if the latter is the orthogonal projection of the former 
on the hyperplane. 

Remarks 6.5. 


• The hyperplane defined by the TLS solution. 



minimizes the total distance of all the points (y„,x n ) from it. We know from geometry that the 
squared distance of each point from this hyperplane is given by 

l^TLS*» ~ ,Vn I 2 

II^TLSII 2 + 1 ' 


Thus, ^tls minimizes the following ratio: 


6»tls = argmin 
0 


\\X0-y\\ 2 

l|0|| 2 + l 


This is basically a normalized (weighted) version of the LS cost. Looking at it more carefully, 
TLS promotes vectors of larger norm. This could be seen as a “deregularizing” tendency of the 
TLS. From a numerical point of view, this can also be verified by (6.80). The matrix to be inverted 
for the TLS solution is more ill-conditioned than its LS counterpart. Robustness of the TLS can 
be improved via the use of regularization. Furthermore, extensions of TLS that employ other cost 
functions, in order to address the presence of outliers, have also been proposed. 

• The TLS method has also been extended to deal with the more general case where y and 6 become 
matrices. For further reading, the interested reader can look at [37,64] and the references therein. A 
distributed algorithm for solving the TLS task in ad hoc sensor networks has been proposed in [4]. 
A recursive scheme for the efficient solution of the TLS task has appeared in [12]. 

• TLS has widely been used in a number of applications, such as computer vision [47], system iden- 
tification [59], speech and image processing [25,51], and spectral analysis [63]. 

Example 6.3. To demonstrate the potential of the TLS to improve upon the performance of the LS 
estimator, in this example, we use noise not only in the input but also in the output samples. To this 
end, we generate randomly an input matrix, X e M 150x90 , filling it with elements according to the 
normalized Gaussian, J\T( 0, 1). In the sequel, we generate the vector 6„ e R 90 by randomly drawing 
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samples also from the normalized Gaussian. The output vector is formed as 

y = x0 o . 

Then we generate a noise vector i) e R 150 , filling it with elements randomly drawn from J\T(0, 0.01), 
and form 


y = y + ri. 

A noisy version of the input matrix is obtained as 

X = X + E, 

where E is filled randomly with elements by drawing samples from Af( 0, 0.2). 

Using the generated y, X, X and pretending that we do not know 0 O , the following three estimates 
are obtained for its value. 

• Using the LS estimator (6.5) together with X and y, the average (over 10 different realizations) 
Euclidean distance of the obtained estimate 0 from the true one is equal to || 0 — 0 O || = 0.0125. 

• Using the LS estimator (6.5) together with X and y, the average (over 10 different realizations) 
Euclidean distance of the obtained estimate 0 from the true one is equal to \\0 — 0 O \\ — 0.4272. 

• Using the TLS estimator (6.80) together with X and y, the average (over 10 different realizations) 
Euclidean distance of the obtained estimate 0 from the true one is equal to \\0 — 0 O \\ — 0.2652. 

Observe that using noisy input data, the LS estimator resulted in higher error compared to the TLS one. 
Note, however, that the successful application of the TLS presupposes that the assumptions that led to 
the TLS estimator are valid. 


PROBLEMS 

6.1 Show that if A e C'” xm is positive semidefinite, its trace is nonnegative. 

6.2 Show that under (a) the independence assumption of successive observation vectors and (b) 
the presence of white noise independent of the input, the LS estimator is asymptotically dis- 
tributed according to the normal distribution, i.e., 

V^V(0 - 0 O ) —► AT(0, ojE- 1 ), 

where <r} } is the noise variance and E x the covariance matrix of the input observation vectors, 
assuming that it is invertible. 

6.3 Let X e C mxl . Then show that the two matrices 

XX H and X H X 

have the same nonzero eigenvalues. 

6.4 Show that if A e C' nxl , then the eigenvalues of XX H (X H X) are real and nonnegative. More- 
over, show that if A.,- ^ X Vj _L v:. 
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6.5 Let X e C mx Show that if v, is the normalized eigenvector of X H X, corresponding to A,- ^ 0, 
then the corresponding normalized eigenvector u,- of XX H is given by 


1 



6.6 Show Eq. (6.19). 

6.7 Show that the right singular vectors, iq,..., v r , corresponding to the r singular values of a rank 
r matrix X solve the following iterative optimization task: compute iq, k = 2, 3,. .., r, such that 


i o 

minimize -||Zd|| , 
subject to ||u|| 2 =l, 


(6.83) 


(6.84) 

(6.85) 


i; _L {vi.w*_i}, 1, 


where 11 • 11 denotes the Euclidean norm. 

6.8 Show that projecting the rows of X onto the k rank subspace, 14 = spanftq,.... iq}, results in 
the largest variance, compared to any other k-dimensional subspace, Z*. 

6.9 Show that the squared Frobenius norm is equal to the sum of the squared singular values. 

6.10 Show that the best k rank approximation of a matrix X or rank r > k, in the Frobenius norm 

sense, is given by 


k 



i=i 


where er, are the singular values and r,, Uj, i = 1.2 ,.... r, are the right and left singular vectors 
of X, respectively. Then show that the approximation error is given by 


r 



\J i=k+l 


6.11 Show that X , as given in Problem 6.10, also minimizes the spectral norm and that 


l|X-X || 2 = o* + i. 


6.12 Show that the Frobenius and spectral norms are unaffected by multiplication with orthogonal 
matrices, i.e., 


X\\ f = \\QXU\\ f 


and 


x\\ 2 = \\Qxu\\ 2 , 


if QQ T = UU t = /. 
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6.13 Show that the null and range spaces of an m x / matrix X of rank r are given by 

A f(X) = span{iv+i,..., u/}, 

TZ(X) = span{i/i__ u r }, 


where 


X = [u\,...,u m \ 


D 

O 


O 

O 


6.14 Show that for the ridge regression 





r(«/ y)u, 


6.15 Show that the normalized steepest descent direction of J(6) at a point Oq for the quadratic norm 
|| v||p is given by 


P- 1 VJ(Q 0 ) 

\\P- 1 XJ(0o)\\p' 


6.16 Explain why the convergence of Newton’s iterative minimization method is relatively insensitive 
to the Hessian matrix. 

Hinf. Let P be a positive definite matrix. Define a change of variables. 


0 = ph, 


and carry out gradient descent minimization based on the new variable. 

6.17 Show that the steepest descent direction v of J(6) at a point 6 o, constrained to 


Mli = l, 


is given by ek, where is the Standard basis vector in the direction k, such that 

|(V/(0 o ))/fcl > |(V/(6»o)) ; |, k±j. 

6.18 Show that the TLS solution is given by 

9 = (x T X-af +l I^ 'x T y, 

where ct/ + i is the smallest singular value of [X : y]. 

6.19 Given a set of centered data points, (y n , x„) e R /+l , derive a hyperplane 

a 1 x + y = 0, 


which crosses the origin, such that the total square distance of all the points from it is minimum. 
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MATLAB® EXERCISES 

6.20 Consider the regression model 

yn — @ o tfn > 

where 0 o e R 200 (/ = 200) and the coefficients of the unknown vector are obtained randomly via 
the Gaussian distribution AT( 0, 1). The noise samples are also i.i.d., according to a Gaussian of 
zero mean and variance ct 2 = 0.01. The input sequence is a white noise one, i.i.d. generated via 
the Gaussian A/”(0, 1). 

Using as training data the samples (y n ,x n ) e R x R 200 , n — 1,2,..., run the APA (Algo- 
rithm 5.2), the NLMS algorithm (Algorithm 5.3), and the RLS algorithm (Algorithm 6.1) to 
estimate the unknown 0 O . 

For the APA, choose /r = 0.2, S — 0.001, and q — 30. Furthermore, in the NLMS set /r = 1.2 
and 8 — 0.001. Finally, for the RLS set the forgetting factor A equal to 1. Run 100 independent 
experiments and plot the average error per iteration in dBs, i.e., 101og 10 (e 2 ), where e 2 = ( y n — 
xj t 0 n - 1 ) 2 . Compare the performance of the algorithms. 

Keep playing with different parameters and study their effect on the convergence speed and 
the error floor in which the algorithms converge. 

6.21 Consider the linear system 

y n = %n @o,n— 1 T l~ln i (6.86) 

where / = 5 and the unknown vector is time-varying. Generate the unknown vector with respect 
to the following model: 

0 o,n = &0 o,n—l T 5 

where a — 0.97 and the coefficients of co„ are i.i.d. drawn from the Gaussian distribution, with 
zero mean and variance equal to 0.1. Generate the initial value 0 0 ,o with respect to A r (0. 1). 

The noise samples are i.i.d., having zero mean and variance equal to 0.001. Furthermore, 
generate the input samples so that they follow the Gaussian distribution A/(0, 1). Compare the 
performance of the NLMS and RLS algorithms. For the NLMS, set /x = 0.5 and 8 = 0.001. For 
the RLS, set the forgetting factor /1 equal to 0.995. Run 200 independent experiments and plot 
the average error per iteration in dBs, i.e., 101og IO (e 2 ), with e 2 = (y„ — xf t 0 n -i) 2 . Compare the 
performance of the algorithms. 

Keep the same parameters, but set the variance associated with &>„ equal to 0.01, 0.001. Play 
with different values of the parameters and the variance of the noise «. 

6.22 Generate an 150 x 90 matrix X, the entries of which follow the Gaussian distribution Af(0, 1). 
Generate the vector 0 o e R 90 . The coefficients of this vector are i.i.d. obtained, also, via the 
Gaussian 1). Compute the vector y = X0 o . Add a 90 x 1 noise vector, rj, to y in order to 
generate y — y + ij. The elements of )] are generated via the Gaussian A r (0, 0.01). In the sequel, 
add a 150 x 90 noise matrix, E, so as to produce X = X + E\ the elements of E are generated 
according to the Gaussian Ab(0, 0.2). Compute the LS estimate via (6.5) by employing (a) the 
true input matrix X and the noisy output y\ and (b) the noisy input matrix X and the noisy 
output y. 

In the sequel, compute the TLS estimate via (6.80) using the noisy input matrix X and the 
noisy output y. 
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Repeat the experiments a number of times and compute the average Euclidean distances be- 
tween the obtained estimates for the previous three cases and the true parameter vector 0 o . 

Play with different noise levels and comment on the results. 
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7.1 INTRODUCTION 

The classification task was introduced in Chapter 3. There, it was pointed out that, in principle, one 
could employ the same loss functions as those used for regression in order to optimize the design of 
a classifier; however, for most cases in practice, this is not the most reasonable way to attack such 
problems. This is because in classification the output random variable, y, is of a discrete nature ; hence, 
different measures than those used for the regression task are more appropriate for quantifying perfor- 
mance quality. 
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The goal of this chapter is to present a number of widely used loss functions and methods. Most of 
the techniques covered are conceptually simple and constitute the basic pillars on which classibcation 
is built. Besides their pedagogical importance, these techniques are stili in use in a number of practical 
applications and often form the basis for the development of more advanced methods, to be covered 
later in the book. 

The classical Bayesian classification rule, the notion of minimum distance classifiers, the logistic 
regression loss function, Fisher’s linear discriminant, classification trees, and the method of combining 
classifiers, including the powerful technique of boosting, are discussed. The perceptron rule, although 
it boasts to be among the most basic classification rules, will be treated in Chapter 18 and it will be 
used as the starting point for introducing neural networks and deep learning techniques. Support vector 
machines are treated in the framework of reproducing kernel Hilbert spaces in Chapter 11. 

In a nutshell, this chapter can be considered a beginner’s tour of the task of designing classifiers. 


7.2 BAYESIAN CLASSIFICATION 

In Chapter 3, a linear classifier was designed via the least-squares (LS) method. However, the squared 
error criterion cannot serve well the needs of the classification task. In Chapters 3 and 6, we have proved 
that the LS estimator is an efficient one only if the conditional distribution of the output variable y, 
given the feature values x, follows a Gaussian distribution of a special type. However, in classification, 
the dependent variable is discrete, hence it is not Gaussian; thus, the use of the squared error criterion 
cannot be justified, in general. We will return to this issue in Section 7.10 (Remarks 7.7), when the 
squared error is discussed against other loss functions used in classification. 

In this section, the classification task will be approached via a different path, inspired by the 
Bayesian decision theory. In spite of its conceptual simplicity, which ties very well with common 
sense, Bayesian classification possesses a strong optimality flavor with respect to the probability of 
error, that is, the probability of wrong decisions/class predictions that a classifier commits. 

Bayesian classification rule: Given a set of M classes, <w/, i — 1,2,..., M, and the respective posterior 
probabilities P(a>i x), classify an unknown feature vector, x, according to the following rule : 


Assign x to a>i = argmax P(a>j |x), j = 1,2,..., M. 


(7.1) 


In words, the unknown pattem, represented by x, is assigned to the class for which the posterior prob¬ 
ability becomes maximum. 

Note that prior to receiving any observation, our uncertainty concerning the classes is expressed 
via the prior class probabilities, denoted by P(coj ), i = 1,2,..., M. Once the observation x has been 
obtained, this extra information removes part of our original uncertainty, and the related statistical 
information is now provided by the posterior probabilities, which are then used for the classification. 


1 Recall that probability values for discrete random variables are denoted by capital P, and PDFs, for continuous random 
variables, by lower case p. 
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Employing in (7.1) the Bayes theorem, 

p(x\(Oj)P((Dj) 


P(Wj\x) = 


p{x) 


j = 1,2 ,...,M, 


where p(x\a>j) are the respective conditional PDFs, the Bayesian classification rule becomes 

Assign x to coj = argmax p(x\a>j)P (&>;), j = 1,2..., M. 

<°i 


(7.2) 


(7.3) 


Note that the probability denstity function (PDF) of the data, pix), in the denominator of (7.2) does 
not enter in the maximization task, because it is a positive quantity independent of the classes u>j\ 
hence, it does not affect the maximization. In other words, the classifier depends on the a priori class 
probabilities and the respective conditional PDFs. Also, recall that 


p(x\coj)P(a>j) = p(a)j,x) := p(y,x), 

where in the current context, the output variable, y, denotes the label associated with the corresponding 
class coj. The last equation verifies what was said in Chapter 3: the Bayesian classifier is a generative 
modeling technique. 

We now turn our attention to how one can obtain estimates of the involved quantities. Recall that 
in practice, all one has at one’s disposal is a set of training data, from which estimates of the prior 
probabilities as well as the conditional PDFs must be obtained. Let us assume that we are given a 
set of training points, (y n , x n ) e D x M*, n = 1,2,..., N, where D is the discrete set of class labeis, 
and consider the general task comprising M classes. Assume that each class op, i = 1,2,..., M, is 
represented by Nj points in the training set, with j Nj = N. Then the a priori probabilities can be 
approximated by 

P(.Wi)*^, * = 1,2.Af. (7.4) 

For the conditional PDFs, p(x\a>i), i — 1,2..., M, any method for estimating PDFs can be mobilized. 
For example, one can assume a known parametric form for each one of the conditionals and adopt the 
maximum likelihood (ML) method, discussed in Section 3.10, or the maximum a posteriori (MAP) 
estimator, discussed in Section 3.1 1.1, in order to obtain estimates of the parameters using the training 
data from each one of the classes. Another alternative is to resort to nonparametric histogram-like tech- 
niques, such as Parzen Windows and the k-nearest neighbor density estimation techniques, as discussed 
in Section 3.15. Other methods for PDF estimation can also be employed, such as mixture modeling, 
to be discussed in Chapter 12. The interested reader may also consuit [38,39]. 


THE BAYESIAN CLASSIFIER MINIMIZES THE MISCLASSIFICATION ERROR 

In Section 3.4, it was pointed out that the goal of designing a classifier is to partition the space in which 
the feature vectors lie into regions, and associate each one of the regions to one and only one class. For 
a two-class task (the generalization to more classes is straightforward), let 77. i, 77-2 be the two regions 
in M. 1 , where we decide in favor of class o>\ and a> 2 , respectively. The probability of classification error 
is given by 


P e = P(x e 77 i, x e C02) + P(x e TZi, x e u>\). 


(7.5) 
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FIGURE 7.1 

(A) The classification error probability for partitioning the feature space, according to the Bayesian optimal classi- 
fier, is equal to the area of the shaded region. (B) Moving the threshold value away from the value corresponding 
to the optimal Bayes rule increases the probability of error, as is indicated by the increase of the area of the corre¬ 
sponding shaded region. 


That is, it is equal to the probability of the feature vector to belong to class u>\ (a> 2 ) and to lie in the 
“wrong” region 77 2 (77 1 ) in the feature space. 

Eq. (7.5) can be written as 


'L 


L 


Pe = P(C0 2 ) p(x\a> 2 )dx + P(coi) I p(x\co\)dx : probability of error. 


(7.6) 


It turns out that the Bayesian classiber, as debned in (7.3), minimizes P e with respect to 'R,\ and 77 2 
[1 1,38]. This is also true for the general case of M classes (Problem 7.1). 

Fig. 7.1 A demonstrates geometrically the optimality of the Bayesian classifier for the two-class 
one-dimensional case and assuming equiprobable classes (Pico 1 ) = P(a> 2 ) — 1/2). The region 77 1 , to 
the left of the threshold value.ro, corresponds to p(x\coj ) > p(x\ « 2 ), and the opposite is true for region 
77t. The probability of error is equal to the area of the shaded region, which is equal to the sum of the 
two integrals in (7.6). In Fig. 7.1B, the threshold has been moved away from the optimal Bayesian 
value, and as a resuit the probability of error, given by the total area of the corresponding shaded 
region, increases. 

7.2.1 AVERAGE RISK 

Because in classification the dependent variable (label), y, is of a discrete nature, the classification 
error probability may seem like the most natural cost function to be optimized. However, this is not 
always true. In certain applications, not all errors are of the same importance. For example, in a medical 
diagnosis system, committing an error by predicting the class of a finding in an X-ray image as being 
“malignant” while its true class is “normal” is less significant than an error the other way around. In the 
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former case, the wrong diagnosis will be revealed in the next set of medical tests. However, the opposite 
may have unwanted consequences. For such cases, one uses an alternative to the probability of error 
cost function that puts relative weights on the errors according to their importance, This cost function 
is known as the average risk and it results in a rule that resembles that of the Bayesian classifier, yet it 
is slightly modified due to the presence of the weights. 

Let us start from the simpler two-class case. The risk or loss associated with each one of the two 
classes is defined as 


n 


= X n f 

JTZi 

'L 


L 


p(x\u>\)dx + A .12 / p(x\u>\)dx, 

n 2 


r 2 = kn I p(x\a> 2 )dx+X 22 / p(x\a> 2 )dx. 

J'R*2 


error 


error 


(7.7) 


(7.8) 


Typically, /, 11 = X 22 = 0, since they correspond to correct decisions. The average risk, to be minimized, 
is given by 

r = P{u>\)r\ + P(co 2 )r 2 . 

Then, following similar arguments as before for the optimal Bayes classifier, the optimal average risk 
classifier rule becomes 


Assign x to &>i (a> 2 ) if: XuPico 1 |x) > (<) X2iP(u>2\x). 


Equivalently, we can write 


(7.9) 


Assign x to co\ (co 2 ) if: X l2 P(coi) p{x\ccn) > (<) X 2 \ P(co 2 ) p(x\co 2 ). (7.10) 

P'(an ) P\o>2 ) 

Note that if ki 2 is large, compared to Xi 1 , this means that class u>\ is more “important.” Looking at it 
from a slightly different view, one can interpret the use of the weights as a way to increase the prior 
probability for class 00 \ with respect to that of the class a> 2 , i.e., 

P'(,(Q 1) P(.o> 1) 

P'(C02) > P(C02)' 

For the M-class problem, the risk (loss) associated with class a>k is defined as 

M 

r k — ^ ' X^i 
i =1 

where X^k — 0 and is the weight that Controls the significance of committing an error by assigning 
a pattern from class a>k to class o>, . The average risk is given by 

M M j M \ 

r — ^2 P(.<0k)n = ( ^2^kiP(cok)p(x\cok) J dx, 

k= 1 1=1 \k= 1 / 


L 


p(x\a>k)dx, 


( 7 . 11 ) 


(7.12) 
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which is minimized if we partition the input space by selecting each TZ, (where we decide in favor of 
class u>i) so that each one of the M integrals in the summation becomes minimum; this is achieved if 
we adopt the rule 


M M 

Assign x to u>i : E X ki P((Ok)p(x\(Ok) < yx kj p(m)p(x\ 60 k ), V/ ± i, 

k =1 jfc=l 


or equivalently. 


M M 

Assign x to coi : E 7 k i P (I-*") < 'y ' hfcj P (a>k |x), V/ ^ i . 
k= 1 k= 1 


(7.13) 


It is common to consider the weights as defining an M x M matrix 

L:=[kij], i, j = 1,2,..., M, (7.14) 

which is known as the loss matrix. Note that if we set \ki = 1, k = 1, 2,..., M, i = 1,2,..., M, k ^ i, 
then we obtain the Bayes rule (verify it). 

Remarks 7.1. 


• The reject option: Bayesian classification relies on the maximum value of the posterior probabilities, 
P(a>i\x), i — 1,2,.. M. However, often in practice, it may happen that for some value x , the 
maximum value is comparable to the values the posterior obtains for other classes. For example, 
in a two-class task, it may turn out that P(a> i|x) = 0.51 and P(a >2 |x) = 0.49. If this happens, it 
may be more sensible not to make a decision for this particular pattern, x. This is known as the 
reject option. If such a decision scenario is adopted, a user-defined threshold value 9 is chosen and 
classification is carried out only if the maximum posterior is larger than this threshold value, so that 
P{a>i\x) > 9. Otherwise, no decision is taken. Similar arguments can be adopted for the average 
risk classification. 


Example 7.1. In a two-class, one-dimensional classification task, the data in the two classes are dis- 
tributed according to the following two Gaussians: 


p(x\co\) = 


v^ exp 



and 


p(x\a> 2 ) 


\ It 


exp 


(■x - 1 )" 


The problem is more sensitive with respect to errors committed on patterns from class coi, which is 
expressed via the following loss matrix: 


L 


i) 1 

0.5 0 
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In other words, 'k\i = 1 and aji — 0.5. The two classes are considered equiprobable. Derive the thresh- 
old value x r , which partitions the feature space R into the two regions 'JZ\. IZ 2 in which we decide in 
favor of class o)\ and u> 2 , respectively. What is the value of the threshold when the Bayesian classifier 
is used instead? 

Solution: According to the average risk rule, the region for which we decide in favor of class &>i is 
given by 

1 1 

«-1 : A. 12~|ct>t) > A.21 ~p{x\co 2 ), 
and the respective threshold value x r is computed by the equation 


exp 



0.5 exp 



which, after taking the logarithm and solving the respective equation, trivially results in 

x,. = 1(1 -21n0.5). 

The threshold for the Bayesian classifier results if we set a 21 = 1, which gives 

1 

XB ~ 2 

The geometry is shown in Fig. 7.2. In other words, the use of the average risk moves the threshold to 
the right of the value corresponding to the Bayesian classifier; that is, it enlarges the region in which 
we decide in favor of the more significant class, a> 1 . Note that this would also be the case if the two 
classes were not equiprobable, as shown by P(o >\) > P(a> 2 ) (for our example, P(ai \) = IPUm))- 


7.3 DECISION (HYPER)SURFACES 

The goal of any classifier is to partition the feature space into regions. The partition is achieved via 
points in R, curves in R 2 , surfaces in R 3 , and hypersurfaces in RE Any hypersurface S is expressed in 
terms of a function 


g:R' 


R, 


and it comprises ali the points such that 


S = jx e R ; : g(jt) = oJ. 


Recall that all points lying on one side of this hypersurface score g(x) > 0 and ali the points on the 
other side score g(x) < 0. The resulting (hyper)surfaces are knows as decision (hyper)surfaces, for 
obvious reasons. Take as an example the case of the two-class Bayesian classifier. The respective 
decision hypersurface is (implicitly) formed by 


g(x) := P(u> 1 |x) - P(a> 2 \x ) = 0. 


(7.15) 
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FIGURE 7.2 

The class distributioris and the resulting threshold values for the two cases of Example 7.1. Note that minimizing the 
average risk enlarges the region in which we decide in favor of the most sensitive class, a> i. 


Indeed, we decide in favor of class coi (region 77 1 ) i f x falis on the positive side of the hypersurface 
dehned in (7.15), and in favor of co 2 for the points falling on the negative side (region 77.2). This is 
illustrated in Fig. 7.3. At this point, recall the reject option from Remarks 7.1 . Points where no decision 
is taken are those that lie close to the decision hypersurface. 

Once we move away from the Bayesian concept of designing classifiers (as we will soon see, 
and this will be done for a number of reasons), different families of functions for selecting g(x) can 
be adopted and the specific form will be obtained via different optimization criteria, which are not 
necessarily related to the probability of error/average risk. 

In the sequel, we focus on investigating the form that the decision hypersurfaces take for the spe- 
cial case of the Bayesian classifier and where the data in the classes are distributed according to the 
Gaussian PDF. This can provide further insight into the way a classifier partitions the feature space and 
it will also lead to some useful implementations of the Bayesian classifier, under certain scenarios. For 
simplicity, the focus will be on two-class classification tasks, but the results are trivially generalized to 
the more general M-class case. 



FIGURE 7.3 


The Bayesian classifier implicitly forms hypersurfaces defined by g(x) = P(a> 1 |x) — P(a> 2 l*) = 0. 
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7.3.1 THE GAUSSIAN DISTRIBUTION CASE 

Assume that the data in each class are distributed according to the Gaussian PDF, so that 

p{x\(Oi) = ^ 2n )in-\E. |l/2 SXP (~^ (X ~ ~ ^')J ’ i = l,2,...,M. (7.16) 

Because the logarithmic function is a monotonically increasing one, it does not affect the point where 
the maximum of a function occurs. Thus, taking into account the exponential form of the Gaussian, the 
computations can be facilitated if the Bayesian rule is expressed in terms of the following functions: 

gi(x) ln (p(x\a>i)P(a>i)) = In p(x\coi) + In P(coi), i = 1,2,, M, (7.17) 

and search for the class for which the respective function scores the maximum value. Such functions 
are also known as discriminant functions. 

Let us now focus on the two-class classification task. The decision hypersurface, associated with 
the Bayesian classifier, can be expressed as 


g(*) = gl(*)-g2(*) = 0, 


(7.18) 


which, after plugging into (7.17) the specific forms of the Gaussian conditionals, and after a bit of 
trivial algebra, becomes 

g{x)= X -(x T Xf [ x-x T Zf [ x} 
quadratic terms 

+p[£f { x — p J ITT 1 * (7.19) 

linear terms 


- 2 ^rVt 


2^2 S 2 W 


ln 


Pjco l) 

P(0)2) 


1 1^21 

- ln- 

2 |27!| 


= 0 . 


constant terms 


This is of a quadratic nature; hence the corresponding (hyper)surfaces are (hyper)quadrics, includ- 
ing (hyper)ellipsoids, (hyper)parabolas, and hyperbolas. Fig. 7.4 shows two examples, in the two- 
dimensional space, corresponding to P(a> i) = Pioji), and 


(a) = [0, 0] r , /r 2 = [4,0f, 17, 


0.3 

0.0 


1.2 

0.0 

0.0 

0.35 

, = 

o 

b 

1.85 


and 


(b) ^[O.Of, gL 2 = [3.2,0] T , E l = 


0.1 0.0 
0.0 0.75 


, S 2 = 


0.75 0.0 
0.0 0.1 


respectively. In Fig. 7.4A, the resulting curve for scenario (a) is an ellipse, and in Fig. 7.4B, the corre¬ 
sponding curve for scenario (b) is a hyperbola. 
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FIGURE 7.4 

The Bayesian classifier for the case of Gaussian distributed classes partitions the feature space via quadrics. (A) The 
case of an ellipse and (B) the case of a hyperbola. 


Looking carefully at (7.19), it is readily noticed that once the covariance matrices for the two 
classes become equal, the quadratic terms cancel out and the discriminant function becomes linear; 
thus, the corresponding hypersurface is a hyperplane. That is, under the previous assumptions, the 
optimal Bayesian classifier becomes a linear classifier, which after some straightforward algebraic 
manipulations (try it) can be written as 


(7.20) 

(7.21) 

(7.22) 

where E is common to the two-class covariance matrix and 

W^i -/Atllr-i :=-/(Ai - Vi ) 7 -/E>) 

is the E~ l norm of the vector (fi l — /t 2 ); alternatively, this is also known as the Mahalanobis distance 
between p j and p 2 - The Mahalanobis distance is a generalization of the Euclidean distance; note that 
for E — I it becomes the Euclidean distance. 

Fig. 7.5 shows three cases for the two-dimensional space. The full black line corresponds to the 
case of equiprobable classes with a covariance matrix of the special form, E = a 2 I. The corresponding 
decision hyperplane, according to Eq. (7.20), is now written as 

g(x)=(pp 2 ) T (x -x 0 ) = 0 . (7.23) 

The separating line (hyperplane) crosses the middle point of the line segment joining the mean value 
points, p\ and p 2 (xo = + AG)). Also, it is perpendicular to this segment, defined by the vec¬ 

tor /x | — p 2 , as is readily verified by the above hyperplane definition. Indeed, for any point x on the 


g(x) = 0 J (x-x 0 )=0. 


0 : = E 1 (p l -/r 2 ), 

1 , , , Pia t) 

+ 0-i) ~ In 


/G - /Et 


P(&>2) Wll\ - P2" 2 


E- 1 
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FIGURE 7.5 

The full black line corresponds to the Bayesian classifier for two equiprobable Gaussian classes that share a com- 
mon covariance matrix of the speciflc form Z = a 2 Z; the line bisects the segment joining the two mean values 
(minimum Euclidean distance classifier). The red one is for the same case but for P(co\) > P(cu 2 ). The dotted line is 
the optimal classifier for equiprobable classes and a common covariance of a more general form, different from a 2 I 
(minimum Mahalanobis distance classifier). 


decision curve (full black line in Fig. 7.5), since the middle point xo also lies on this curve, the cor- 
responding vector x — xo is parallel to this line. Hence, the inner product in Eq. (7.23) being equal 
to zero means that the decision line is perpendicular to fi^ — //, 2 ■ The red line corresponds to the case 
where P(a> i) > P(a> 2 )- It gets closer to the mean value point of class a> 2 , thus enlarging the region 
where one decides in favor of the more probable class. Note that in this case, the logarithm of the ratio 
in Eq. (7.22) is positive. Finally, the dotted line corresponds to the equiprobable case with the com¬ 
mon covariance matrix being of a more general form, E a 2 1 . The separating hyperplane crosses xo 
but it is rotated in order to be perpendicular to the vector E~ { (/ij — fi 2 ), according to (7.20)—(7.21). 
For each one of the three cases, an unknown point is classified according to the side of the respective 
hyperplane on which it lies. 

What was said before for the two-class task is generalized to the more general M-class problem; 
the separating hypersurfaces of two contiguous regions 77.,, TZj associated with classes &>,•, coj obey the 
same arguments as the ones adopted before. For example, assuming that all covariance matrices are 
the same, the regions are partitioned via hyperplanes, as illustrated in Fig. 7.6. Moreover, each region 
'R,. i = 1, 2,..., M, is convex (Problem 7.2); in other words, joining any two points within 72.,, all the 
points lying on the respective segment lie in 72, as well. 

Two special cases are of particular interest, leading to a simple classification rule. The rule will be 
expressed for the general M-class problem. 

Minimum Distance Classifiers 

There are two special cases, where the optimal Bayesian classifier becomes very simple to compute 
and it also has a strong geometric flavor. 

• Minimum Euclidean distance classifier: Under the assumptions of (a) Gaussian distributed data in 
each one of the classes, i.e., Eq. (7.16), (b) equiprobable classes, and (c) common covariance matrix 
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X 1 


FIGURE 7.6 

When data are distributed according to the Gaussian distributiori and they share the same covariance matrix in ali 
classes, the feature space is partitioned via hyperplanes, which form polyhedral regions. Note that each region is 
associated with one class and it is convex. 


in all classes of the special form E = a 2 1 (individual features are independent and share a common 
variance), the Bayesian classification rule becomes equivalent to 

Assign x to class <y,- : i — argmin(jc — fij) T (x — Hj), j = 1, 2,... M. (7.24) 

This is a direct consequence of the Bayesian rule under the adopted assumptions. In other words, 
the Euclidean distance of x is computed from the mean values of all classes and it is assigned to the 
class for which this distance becomes smaller. 

For the case of the two classes, this classification rule corresponds to the full black line of Fig. 7.5. 
Indeed, recalling our geometry basies, any point that lies on the left side of this hyperplane that 
bisects the segment /tj — /r 2 is closer to fi\ than to fi 2 , in the Euclidean distance sense. The opposite 
is true for any point lying on the right of the hyperplane. 

• Minimum Mahalanobis distance classifier. Under the previously adopted assumptions, but with the 
covariance matrix being of the more general form E ^ er 2 /, the rule becomes 

Assign jc to class eu,-: i = argmin(x — /ij) T E~ l (x — fij), j = l,2,...M. (7.25) 

Thus, instead of looking for the minimum Euclidean distance, one searches for the minimum Maha¬ 
lanobis distance; the latter is a weighted form of the Euclidean distance, in order to account for the 
shape of the underlying Gaussian distributions [38]. For the two-class case, this rule corresponds to 
the dotted line of Fig. 7.5. 

Remarks 7.2. 

• In statisties, adopting the Gaussian assumption for the data distribution is sometimes called lin- 
ear discriminant analysis (LDA) or quadratic discriminant analysis (QDA), depending on the 
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adopted assumptions with respect to the underlying covariance matrices, which will lead to linear 
or quadratic discriminant functions, respectively. In practice, the ML method is usually employed 
in order to obtain estimates of the unknown parameters, namely, the mean values and the covari¬ 
ance matrices. Recall from Example 3.7 of Chapter 3 that the ML estimate of the mean value of a 
Gaussian PDF, obtained via N observations, x n , n = 1, 2,..., N, is equal to 



n= 1 


Moreover, the ML estimate of the covariance matrix of a Gaussian distribution, using N observa¬ 
tions, is given by (Problem 7.4) 


1 vv 

X ML — — / ( x n 


AmlX-Di — Aml) 


n —1 


(7.26) 


This corresponds to a biased estimator of the covariance matrix. An unbiased estimator results if 
(Problem 7.5) 

I N 

Z = “ V-ML)( x n ~ ILml) T ■ 

n =1 

Note that the number of parameters to be estimated in the covariance matrix is 0(l 2 / 2), taking into 
account its symmetry. 


Example 7.2. Consider a two-class classification task in the two-dimensional space, with P(o >\) = 
P(co 2 ) = 1/2. Generate 100 points, 50 from each class. The data from each class, oj/, i — 1,2, stem 
from a corresponding Gaussian, Af(fij , X)), where 

/ri = [0,-2f, fi 2 — [0, 2] r , 

and (a) 


-0.4 
1 

Fig. 7.7 shows the decision curves formed by the Bayesian classifier. Observe that in the case of 
Fig. 7.7A, the classifier turns out to be a linear one, while for the case of Fig. 7.7B, it is nonlinear 
of a parabola shape. 


X! = X 2 = 


1.2 0.4 
0.4 1.2 


°r (b) 


Xi = 


1.2 0.4 
0.4 1.2 


X 2 = 


1 

-0.4 


Example 7.3. In a two-class classification task, the data in each one of the two equiprobable classes 
are distributed according to the Gaussian distribution, with mean values /tj = [0, 0] 7 and /i 2 = [3, 3] r , 
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FIGURE 7.7 

If the data in the feature space follow a Gaussian distribution in each one of the classes, then (A) if ali the covari- 
ance matrices are equal, the Bayesian classifier is a hyperplane; (B) otherwise, it is a quadric hypersurface. 


respectively, sharing a common covariance matrix 


1.1 0.3 
0.3 1.9 


Use the Bayesian classifier to classify the point x = [1.0, 2.2] 7 into one of the two classes. 

Because the classes are equiprobable, are distributed according to the Gaussian distribution, and 
share the same covariance matrix, the Bayesian classifier is equivalent to the minimum Mahalanobis 
distance classifier. The (square) Mahalanobis distance of the point x from the mean value of class co\ 
is 


[ 1 . 0 , 2 . 2 ] 


0.95 

-0.15' 

' 1.0" 

_ -0.15 

0.55 

. 2 - 2 _ 


2.95, 


where the matrix in the middle on the left-hand side is the inverse of the covariance matrix. Similarly 
for class co 2 , we obtain 


dl = [- 2 . 0 , - 0 . 8 ] 


0.95 

-0.15 ' 

' -2.0' 

_ -0.15 

0.55 

. -°- 8 _ 


3.67. 


Hence, the pattern is assigned to class w\, because its distance from /t l is smaller compared to that 
from [t 2 . Verify that if the Euclidean distance were used instead, the pattern would be assigned to 
class a> 2 - 
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7.4 THE NAIVE BAYES CLASSIFIER 

We have already seen that in the case the covariance matrix is to be estimated, the number of unknown 
parameters is of the order of 0(l 2 / 2). For high-dimensional spaces, besides the fact that this estimation 
task is a formidable one, it also requires a large number of data points, in order to obtain statistically 
good estimates and avoid overfitting, as discussed in Chapter 3. In such cases, one has to be content with 
suboptimal Solutions. Indeed, adopting an optimal method while using bad estimates of the involved 
parameters can lead to a bad overall performance. 

The naive Bciyes classifier is a typical and popular example of a suboptimal classifier. The basic 
assumption is that the components (features) in the feature vector are statistically independent; hence, 
the joint PDF can be written as a product of / marginals, 

l 

p{x\(Oi) = ]""[ p(xk\coi), i = 1, 2,..., M. 
k= l 

Having adopted the Gaussian assumption, each one of the marginals is described by two parameters, the 
mean and the variance; this leads to a total of 21 unknown parameters to be estimated per class. This is 
a substantial saving compared to the 0(1 2 /2) number of parameters. It turns out that this simplistic 
assumption can end up with better results compared to the optimal Bayes classifier when the size of 
the data samples is limited. 

Although the naive Bayes classifier was introduced in the context of Gaussian distributed data, its 
use is also justified for the more general case. In Chapter 3, we discussed the curse of dimensionality 
issue and it was stressed that high-dimensional spaces are sparsely populated. In other words, for a fixed 
finite number of data points, N, within a cube of fixed size for each dimension, the larger the dimension 
of the space is, the larger the average distance between any two points becomes. Hence, in order to get 
good estimates of a set of parameters in large spaces, an increased number of data is required. Roughly 
speaking, if N data points are needed in order to get a good enough estimate of a PDF in the real axis, 
N 1 data points would be needed for similar accuracy in an /-dimensional space. Thus, by assuming 
the features to be mutually independent, one will end up estimating l one-dimensional PDFs, hence 
substantially reducing the need for data. 

The independence assumption is a common one in a number of machine learning and statistics 
tasks. As we will see in Chapter 15, one can adopt more “mild” independence assumptions that lie in 
between the two extremes, which are full independence and full dependence. 


7.5 THE NEAREST NEIGHBOR RULE 

Although the Bayesian rule provides the optimal solution with respect to the classification error proba- 
bility, its application requires the estimation of the respective conditional PDFs; this is not an easy task 
once the dimensionality of the feature space assumes relatively large values. This paves the way for 
considering alternative classification rules, which becomes our focus from now on. 

The k-nearest neighbor (fc-NN) rule is a typical nonparametric classifier and it is one of the most 
popular and well-known classifiers. In spite of its simplicity, it is stili in use and stands next to more 
elaborate schemes. 
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Consider N training points, (y n , x n ), n = 1,2./V, for an M-class classification task. At the 

heart of the method lies a parameter k, which is a user-defined parameter. Once k is selected, then 
given a pattern x, assign it to the class in which the majority of its k nearest (according to a metric, 
e.g., Euclidean or Mahalanobis distance) neighbors, among the training points, belong. The parameter 
k should not be a multiple of M, in order to avoid ties. The simplest form of this rule is to assign the 
pattern to the class in which its nearest neighbor belongs, meaning k = 1. 

It turns out that this conceptually simple rule tends to the Bayesian classiher if (a) N —> oo, 
(b) k —> oo, and (c) k/N —> 0. In practical terms, these conditions mean that N and k must be 
large, yet k must be relatively small with respect to N. More specifically, it can be shown that the 
classification errors Pnn and P^nn satisfy, asymptotically, the following bounds [9]: 


Pb < Pnn < 2Pb 


(7.27) 


for the k = 1 NN rule, and 



(7.28) 


for the more general k-NN version; /V, is the error corresponding to the optimal Bayesian classiher. 
These two formulas are quite interesting. Take for example (7.27). It says that the simple NN rule will 
never give an error larger than twice the optimal one. If, for example, Pb = 0.01, then Pnn < 0.02. 
This is not bad for such a simple classiher. All this says is that if one has an easy task (as indicated by 
the very low value of Pb), the NN rule can also do a good job. This, of course, is not the case if the 
problem is not an easy one and larger error values are involved. The bound in (7.28) says that for large 
values of k (provided, of course, N is large enough), the performance of the &-NN tends to that of the 
optimal classiher. In practice, one has to make sure that k does not get values close to N , but remains 
a relatively small fraction of it. 

One may wonder how a performance close to the optimal classiher can be obtained, even in theory 
and asymptotically, because the Bayesian classiher exploits the statistical information for the data dis- 
tribution while the k-NN does not take into account such information. The reason is that if A is a very 
large value (hence the space is densely populated with points) and k is a relatively small number, with 
respect to N, then the nearest neighbors will be located very close to x. Then, due to the continuity of 
the involved PDFs, the values of their posterior probabilities will be close to P(a>,- x), i — 1,2,..., /17. 
Furthermore, for large enough k, the majority of the neighbors must come from the class that scores 
the maximum value of the posterior probability given x. 

A major drawback of the &-NN rule is that every time a new pattern is considered, its distance from 
all the training points has to be computed, then selecting the k closest to it points. To this end, various 
searching techniques have been suggested over the years. The interested reader may consuit [38] for a 
related discussion. 

Remarks 7.3. 

• The use of the £-NN concept can also be adopted in the context of the regression task. Given an ob- 

servation x, one searches for its k closest input vectors in the training set, denoted as X(i f ,..., x (k), 

and computes an estimate of the output value y as an average of the respective outputs in the training 
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(A) (B) 


FIGURE 7.8 

A two-class classification task. The dotted curve corresponds to the optimal Bayesian classifier. The full line curves 
correspond to the (A) 1-NN and (B) 13-NN classifiers. Observe that the 13-NN is closer to the Bayesian one. 


set, represented by 



k 


E 7(0- 
1 = 1 


Example 7.4. An example that illustrates the decision curves for a two-class classification task in 
the two-dimensional space, obtained by the Bayesian, the 1-NN, and the 13-NN classifier, is given in 
Fig. 7.8. A number of N — 100 data are generated for each class by Gaussian distributions. The decision 
curve of the Bayes classifier has the form of a parabola, while the 1-NN classifier exhibits a highly 
nonlinear nature. The 13-NN rule forms a decision line close to the Bayesian one. 


7.6 LOGISTIC REGRESSION 

In Bayesian classification, the assignment of a pattern in a class is performed based on the posterior 
probabilities, P(ojj |x). The posteriors are estimated via the respective conditional PDFs, which is not, 
in general, an easy task. The goal in this section is to model the posterior probabilities directly, via 
the logistic regression method. This name has been established in the statistics community, although 
the model refers to classification and not to regression. This is a typical example of the discriminative 
modeling approach, where the distribution of data is not taken into account. 

The two-class case: The starting point is to model the ratio of the posteriors as 


, P(coi\x) aT 

ln- = 0 j 

P(o> 2 \x) 


two-class logistic regression, 


(7.29) 
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t 


FIGURE 7.9 

The sigmoid link function. 


where the constant term, 9q, has been absorbed in 0. Taking into account that 

P(a>i |x) + P(a> 2 \x) = 1 


and defining 


t:=0 T x, 


it is readily seen that the model in (7.29) is equivalent to 


P((Wl|*) = <x(f). 


a(t) := 


1 

1 + exp (—t) ’ 


(7.30) 

(7.31) 


and 

exp(— t) 

P(w 2 \x) = l- P(an\x) = , , / ■ (7.32) 

1 + exp (-1) 

The function a(t) is known as the logistic sigmoid or sigmoid link function and it is shown in Fig. 7.9. 

Although it may sound a bit “mystical” as to how one thought of such a model, it suffices to look 
more carefully at (7. 1 7)—(7. 1 8) to demystify it. Let us assume that the data in a two-class task follow 
Gaussian distributions with = F 2 = Under such assumptions, and taking into account the Bayes 
theorem, one can write 


P(wi\x) p(x\u>\)P(w\) 

ln-= ln- 

P(« 2 |*) p{x\a>2)P(v 2 ) 

= ln p{x\u>i) + lnP(®i) - (\np{x\co2) + lnP(«2)) 
= gW- 


(7.34) 

(7.35) 
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Furthermore, we know that under the previous assumptions, g(x) is given by Eqs. (7.20)-(7.22); hence, 
we can write 

P(coi\x) T 

ln-= (fii — fii) 1 S 1 x + constants, (7.36) 

P(co 2 \x) 

where “constants” refers to all terms that do not depend on x. In other words, when the distributions that 
describe the data are Gaussians with a common covariance matrix, then the log ratio of the posteriors 
is a linear function. Thus, in logistic regression, all we do is to go one step ahead and adopt such a 
linear model, irrespective of the underlying data distributions. 

Moreover, even if the data are distributed according to Gaussians, it may stili be preferable to adopt 
the logistic regression formulation instead of that in (7.36). In the latter formulation, the covariance 
matrix has to be estimated, amounting to 0(I 2 /2) parameters. The logistic regression formulation only 
involves l + 1 parameters. That is, once we know about the linear dependence of the log ratio on x, 
we can use this a priori information to simplify the model. Of course, assuming that the Gaussian 
assumption is valid, if one can obtain good estimates of the covariance matrix, employing this extra 
information can lead to more efficient estimates, in the sense of lower variance. The issue is treated in 
[12]. This is natural, because more information concerning the distribution of the data is exploited. In 
practice, it turns out that using the logistic regression is, in general, a safer bet compared to the linear 
discriminant analysis (LDA). 

The parameter vector 0 is estimated via the ML method applied on the set of training samples, 
(y n , x n ), n = 1,2 ,N,y n e [0, 1}. The likelihood function can be written as 

N v i_v 

P{.yu...,y N -0) = Y\(^{0 T x n )y n (\-a{9 T x n )) . ( 7 . 37 ) 

n= 1 

Indeed, if x n originates from class w \, then y n = 1 and the corresponding probability is given by 
a(0 l x n ). On the other hand, if x„ comes from on , then y n = 0 and the respective probability is given 
by 1 — a(0 ! x n ). Assuming independence among the successive observations, the likelihood is given 
as the product of the respective probabilities. 

Usually, we consider the negative log-likelihood given by 


N 

L(0) = -J2 {yn In s n + (1 - >’„) In(1 -s n )), (7.38) 

n= 1 

where 

s n \=o(0 T x n ). (7.39) 

The log-likelihood cost function in (7.38) is also known as the cross-entropy error. Minimization of 
L(6) with respect to 0 is carried out iteratively by any iterative minimization scheme, such as the 
gradient descent or Newton’s method. Both schemes need the computation of the respective gradient, 
which in turn is based on the derivative of the sigmoid link function (Problem 7.6) 


da(t) 


= cr(r)(l -cr(0). 


dt 


(7.40) 
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The gradient is given by (Problem 7.7) 


N 

VL(0) = - y„)x„ 

H=1 

= X r (s-y), (7.41) 

where 

X r = [*i. xjv], s := [^ 1 , ...,^] r , y = [yi,...,yiv] r . 

The Hessian matrix is given by (Problem 7.8) 

N 

V 2 L(0) = ^2 S n ( 1 -S n )x n X T n 
n =1 

= X T RX, (7.42) 

where 

7? :=diag{ii(l-si), ...,Siv(l -sjv)}- (7.43) 

Note that because 0 < s n < I. by definition of the sigmoid link function, matrix R is positive definite 

(see Appendix A); hence, the Hessian matrix is also positive definite (Problem 7.9). This is a necessary 
and sufficient condition for convexity. Thus, the negative log-likelihood function is convex, which 
guarantees the existence of a unique minimum (e.g., [1] and Chapter 8). 

Two of the possible iterative minimization schemes to be used are 

• Gradient descent (Section 5.2) 


e ® = 6 «-' i-m/xV-d-jO. 


(7.44) 


• Newton’s scheme (Section 6.7) 

0(0 = 0 d- 1 ) _ jJL . (x T R {i ~ {) x} 1 X T (s u ~ l) - y ) 
= * X T R {i ~ l) z (i ~ l \ 


(7.45) 


where 

z (i ~ l) \=xe (i ~ l) -(V !_1) ) 1 (s (, ' _1) - y). (7.46) 

Eq. (7.45) is a weighted version of the LS solution (Chapters 3 and 6); however, the involved quantities 
are iteration dependent and the resulting scheme is known as iterative reweighted least-squares scheme 
(IRLS) [36]. 


2 


Convexity is discussed in more detail in Chapter 8 . 
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Maximizing the likelihood we may run into problems if the training data set is linearly separable. 
In this case, any point on a hyperplane, 0 1 x = 0, that solves the classification task and separates the 
samples from each class (note that there are infinitely many of such hyperplanes) results in cr(x) = 0.5, 
and every training point from each class is assigned a posterior probability equal to one. Thus, ML 
forces the logistic sigmoid to become a step function in the feature space and equivalently ||0|| —> oo. 
This can lead to overfitting and it is remedied by including a regularization term, e.g., ||0|| 2 , in the 
respective cost function. 

The M-class case: For the more general M-class classification task, the logistic regression is defined 
for m = 1,2,..., M as 


P(<o m \x) 


exp(6>J,x) 
E'/ii exp(0jx) 


multiclass logistic regression. 


(7.47) 


The previous definition is easily brought into the form of a linear model for the log ratio of the poste- 
riors. Divide, for example, by P(com\x) to obtain 

P((O m \x) J ~T 

In—-— = (9 m -VM) X = 0 m x. 

P(com\x) 

Let us define, for notational convenience. 


4>nm ■— P(u>m |*n), tl = 1,2, N, m = 1, 2, ..., M, 


and 


t m 0f n x, m= 1,2 ,, M. 
The likelihood function is now written as 


N M 

P(y: »i,...,M = nn , (7.48) 

n= 1 m =1 


where y nm = 1 if x n e co m and zero otherwise. The respective negative log-likelihood function becomes 

N M 

L{0 1, — 0m ) = - EE ynm 1 ^ 1 4 > nm ^ ( 7 . 49 ) 

n= 1 m —1 


which is the generalization of the cross-entropy cost function for the case of M classes. Minimization 

with respect to 6 m , «7 = 1.M, takes place iteratively. To this end, the following gradients are used 

(Problems 7.10-7.12): 


3 (pn 


dt j 


— (pnm (&mj fini) j 


(7.50) 
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where <5 m/ - is one if m — j and zero otherwise. Also, 


N 

V0 ; .L(0 1; ..., 0 M ) = -y n j)x n . (7.51) 

/ 7=1 

The respective Hessian matrix is an ilM) x ( IM ) matrix, comprising l x / blocks. Its k. j block ele- 
ment is given by 

N 

Ve k V0jL(0i,. ..,0 M ) = T,4>nj (4/ -<pnk)x n xl. (7.52) 

/7 = 1 

The Hessian matrix is also positive definite, which guarantees uniqueness of the minimum as in the 
two-class case. 

Remarks 7.4. 

• Probit regression: Instead of using the logistic sigmoid function in (7.30) (for the two-class case), 
other functions can also be adopted. A popular function in the statistical community is the probit 
function, which is defined as 


<*>« := 



1 

2 


1 + —erf (t) 

V2 


where erf is the error function defined as 


erf(f) = 



(7.53) 


In other words, P{a>\ \t) is modeled to be equal to the probability of a normalized Gaussian variable 
to lie in the interval (—oo, t]. The graph of the probit function is very similar to that of the logistic 
one. 


7.7 FISHER’S LINEAR DISCRIMINANT 

We now tum our focus to designing linear classifiers. In other words, irrespective of the data distribu- 
tion in each class, we decide to partition the space in terms of hyperplanes, so that 

gOc) = 0 r jc + 0 o = O. (7.54) 

We have dealt with the task of designing linear classifiers in the framework of the LS method in 
Chapter 3. In this section, the unknown parameter vector will be estimated via a path that exploits a 
number of important notions relevant to classification. The method is known as Fisher’s discriminant 
and it can be dressed up with different interpretations. 
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Thus, its significance lies not only in its practical use but also in its pedagogical value. Prior to 
presenting the method, let us first discuss some related issues concerning the selection of features that 
describe the input patterns and some associated measures that can quantify the “goodness” of a selected 
set of features. 

7.7.1 SCATTER MATRICES 

Two of the major phases in designing a pattern recognition system are Ihe feature generation and fea¬ 
ture selection phases. Selecting information-rich features is of paramount importance. If “bad” features 
are selected, whatever smart classifier one adopts, the performance is bound to be poor. Feature gener- 
aton/selection techniques are treated in detail in, e.g., [38,39], to which the interested reader may refer 
for further information. At this point, we only touch on a few notions that are relevant to our current 
design of a linear classifier. Let us first quantify what a “bad” and a “good” feature is. The main goal 
in selecting features, and, thus, in selecting the feature space in which one is going to work, can be 
summarized in the following way. Select the features to create a feature space in which the points, 
which represent the training patterns, are distributed so as to have 


large between-class distance 
and 

small within-class variance. 


Fig. 7.10 illustrates three different possible choices for the case of two-dimensional feature spaces. 
Each point corresponds to a different input pattern and each figure corresponds to a different choice 
of the pair of features; that is, each figure shows the distribution of the input patterns in the respective 
feature space. Common sense dictates that selecting as features to represent the input patterns those 
associated with Fig. 7. 10C is the best one; the points in the three classes form groups that lie relatively 
far away from each other, and at the same time the data in each class are compactly clustered together. 
The worst of the three choices is that of Fig. 7.10B, where data in each class are spread around their 
mean value and the three groups are relatively close to each other. The goal in feature selection is to 
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(A) (B) (C) 


FIGURE 7.10 


Three different choices of two-dimensional feature spaces: (A) small within-class variance and small between-class 
distance; (B) large within-class variance and small between-class distance; and (C) small within-class variance and 
large between-class distance. The last one is the best choice out of the three. 
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develop measures that quantify the “slogan” given in the box above. The notion of sccitter matrices is 
of relevance to us here. 

• Within-class scatter matrix: 

M 

£w = J2 p ^k)S k , (7.55) 

k= 1 

where E k is the covariance matrix of the points in the £th among M classes. In words, E w is the 
average covariance matrix of the data in the specific /-dimensional feature space. 

• Between-class scatter matrix: 


M 

Eb = 'Y! p ^ c °k)(Bk - Mo)(/At - /A>) r , (7.56) 

k=\ 

where // (l is the overall mean value defined by 


M 

V-o = '^ J P(m)Hk- ( 7 -57) 

k= 1 

Another commonly used related matrix is the following. 

• Mixture scatter matrix: 

E m = E W + E b . (7.58) 

A number of criteria that measure the “goodness” of the selected feature space are built around these 
scatter matrices; three typical examples are (e.g., [17,38]) 


tracej E m } \E m \ 

- > ^2 = - 5 

tracefXT} | E w | 


/3 = tracefl ^ 1 E b }, 


(7.59) 


where | • | denotes the determinant of a matrix. 

The /1 criterion is the easiest to understand. To simplify the arguments, let us focus on the two- 
dimensional (two features, x |, X 2 ) case involving three classes. Recall from the definition of the covari¬ 
ance matrix of a random vector x in Eq. (2.31) of Chapter 2 that the elements across the main diagonal 
are the variances of the respective random variables. Thus, for each class, k— 1,2, 3, the trace of the 
corresponding covariance matrix will be the sum of the variances for each one of the two features, i.e., 


tracef E k } = er A1 + <t a 2 2 . 


Hence, the trace of E w is the average total variance of the two features over all three classes, 

3 

tracef ^ ' P{cok) "F .— s w . 

k= 1 
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On the other hand, the trace of is equal to the average, over ali classes, of the total squared Euclidean 
distance of each individual feature mean value from the corresponding global one, i.e., 


3 


tracefZ),} = ^ p ^° 3) ((Mi - Mot) 2 + (M*2 - M 02 ) 2 ) 
k= 1 

(7.60) 

3 

= X! P ^ ( °k)\\v-k - Moli 2 := Sb, 
k= 1 

(7.61) 


where fi k — [/x^i, H-kiY is the mean value for the £th class and fi 0 — [/zoi, / 202 ] r is the global mean 
vector.' Thus, the ,/| criterion is equal to 

, s w T s b - s b 

J i =-= I 4-. 

s w s w 

In other words, the smaller the average total variance is and the larger the average squared Euclidean 
distance of the mean values from the global one gets, the larger the value of /i becomes. Similar 
arguments can be made for the other two criteria. 


7.7.2 FISHER’S DISCRIMINANT: THE TWO-CLASS CASE 


In Fisher’s linear discriminant analysis, the emphasis in Eq. (7.54) is only on 0\ the bias term Oq is left 
out of the discussion. The inner product 0 T x can be viewed as the projection of x along the vector 0. 
Strictly speaking, we know from geometry that the respective projection is also a vector, y, given by 
(e.g., Section 5.6) 

e T x e 

y ~mm' 


where pyj is the unit norm vector in the direction of 0. From now on, we will focus on the scalar 


value of the projection, y := 0 T x, and ignore the scaling factor in the denominator, because scaling ali 
features by the same value has no effect on our discussion. The goal, now, is to select that direction 0 so 
that after projecting along this direction, (a) the data in the two classes are as far away as possible from 
each other, and (b) the respective variances of the points around their means, in each one of the classes, 
are as small as possible. A criterion that quantifies the aforementioned goal is Fisher’s discriminant 
ratio (FDR), defined as 


FDR = 


(Mi - M 2) 2 . 
„2 , ^2 ‘ 
°j + °2 


Fisher’s discriminant ratio, 


(7.62) 


where /x 1 and /22 are the (scalar) mean values of the two classes after the projection along 0 , meaning 

/x k = 0 T fi k , £= 1 , 2 . 


3 Note that the last equation can also be derived by using the property trace{A} = trace(A^), applied on U/y. 








326 CHAPTER 7 CLASSIFICATION: A TOUR OF THE CLASSICS 


However, we have 

(/zi-/z 2 ) 2 = 0 T (fi l - n 2 )(fii - IL 2 ) 7 0 = °‘ s b0, 

S b := (ni - M 2 XM 1 ~ H-i) T ■ (7.63) 

Note that if the classes are equiprobable, S/, is a scaled version of the between-class scatter matrix in 
(7.56) (this is easily checked, since under this assumption, /t 0 = I /2(fi\ + /x 2 )), and we have 

(Mi - /r. 2) 2 oc 0 T Z b 0. (7.64) 


Moreover, 


a l = E [(y - M/t) 2 ] = e [e T (x - fi k )(x - fi k ) J 0 

which leads to 

al + al=e T s w e, 

where S w = + X3. Note that if the classes are equiprobable, S w becomes a scaled version of the 

within-class scatter matrix defined in (7.55), and we have 


= 0 T Z k e , k— 1,2, (7.65) 


<t 2 + cr| oc 0 t E w 0. 


(7.66) 


Combining (7.62), (7.64), and (7.66) and neglecting the proportionality constants, we end up with 


FDR = 


e T z b o 

0 T S W 0 ' 


generalized Rayleigh quotient. 


(7.67) 


Our goal now becomes that of maximizing the FDR with respect to 0. This is a case of the generalized 
Rayleigh ratio , and it is known from linear algebra that it is maximized if 0 satisfies 


Z h 0 = \E w 0, 


where X is the maximum eigenvalue of the matrix Y,~ 1 E/, (Problem 7.14). However, for our specific 
case ! here, we can bypass the need for solving an eigenvalue-eigenvector problem. Taking into account 
that S w is a scaled version of S/, in Eq. (7.63), the last equation can be rewritten as 

XS w 0 oc (mi - M 2 MM 1 - M 2) 7 0 oc (Mi - M 2 ). 

since the inner product, (// 1 — fin) 1 0, is a scalar. In other words, S w 0 lies in the direction of (fi\ — fi 2 ), 
and because we are only interested in the direction, we can finally write 


0 = Z w Ujl\ -M 2 ). 


(7.68) 


4 


Zb is a rank one matrix and, hence, there is only one nonzero eigenvalue; see also Problem 7.15. 
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(A) 


(B) 


FIGURE7.11 

(A) The optimal direction resulting from Fisher’s discriminant for two spherically distributed classes. The direction 
on which projection takes place is parallel to the segment joining the mean values of the data in the two classes. 

(B) The line on the bottom left of the figure corresponds to the direction that results from Fisher’s discriminant; 
observe that it is no longer parallel to — fi 2 . For the sake of comparison, observe that projecting on the other line 
on the right results in class overlap. 


assuming of course that S w is invertible. In practice, Z w is obtained as the respective sample mean 
using the available observations. 

Fig. 7.11 A shows the resulting direction for two spherically distributed (isotropic) classes in the 
two-dimensional space. In this case, the direction for projecting the data is parallel to (//.j — /t, 2 ). In 
Fig. 7.1 1B, the distribution of the data in the two classes is not spherical, and the direction of projection 
(the line to the bottom left of the figure) is not parallel to the segment joining the two mean points. 
Observe that if the line to the right is selected, then after projection the classes do overlap. 

In order for the Fisher discriminant method to be used as a classifier, a threshold Oo must be adopted, 
and decision in favor of a class is performed according to the rule 


y = (H l - li 2 f Z^x + Oo 


> 0, class w i, 
< 0, class coi- 


(7.69) 


Compare, now, (7.69) with (7.20)-(7.22); the latter were obtained via the Bayes rule for the Gaus- 
sian case, when both classes share the same covariance matrix. Observe that for this case, the resulting 
hyperplanes for both methods are parallel and the only difference is in the threshold value. Note, how- 
ever, that the Gaussian assumption was not needed for the Fisher discriminant. This justifies the use 
of (7.20)-(7.22), even when the data are not normally distributed. In practice, depending on the data, 
different threshold values may be used. 

Finally, because the world is often small, it can be shown that Fisher’s discriminant can also be seen 
as a special case of the LS solution if the target class labeis, instead of ±1, are chosen as ^j- and -j—, 
respectively, where N is the total number of training samples, /\j is the number of samples in class o >\, 
and N 2 is the corresponding number in class cm , e.g., [41]. 

Another point of view for Fisher’s discriminant method is that it performs dimensionality reduction 
by projecting the data from the original /-dimensional space to a lower one-dimensional space. This 
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reduction in dimensionality is performed in a supervised way, by exploiting the class labeis of the 
training data. As we will see in Chapter 19, there are other techniques in which the dimensionality 
reduction takes place in an unsupervised way. The obvious question now is whether it is possible to 
use Fisher’s idea to reduce the dimensionality, not to one, but to another intermediate value between 
one and /, where / is the dimensionality of the feature space. It turns out that this is possible, but it 
also depends on the number of classes. More on dimensionality reduction techniques can be found in 
Chapter 19. 


7.7.3 FISHER’S DISCRIMINANT: THE MULTICLASS CASE 


Our starting point for generalizing to the multiclass case is the /3 criterion defined in (7.59). It can be 
readily shown that the FDR criterion, used in the two-class case, is directly related to the /3 one, once 
the latter is considered for the one-dimensional case and for equiprobable classes. For the more general 
multiclass formulation, the task becomes that of estimating an / x m, m < /, matrix. A, such that the 
linear transformation from the original E 7 to the new, lower-dimensional, R'" space, expressed as 



(7.70) 


retains as much classification-related information as possible. Note that in any dimensionality reduction 
technique, some of the original information is, in general, bound to be lost. Our goal is for the loss to 
be as small as possible. Because we chose to quantify classification-related information by the /3 
criterion, the goal is to compute A in order to maximize 


/ 3 (A) = trace{X' u , :y 1 X' 6v }, 


(7.71) 


where E wy and E by are the within-class and between-class scatter matrices measured in the trans- 
formed lower-dimensional space. Maximization follows Standard arguments of optimization with re- 
spect to matrices. The algebra gets a bit involved and we will state the final resuit. Details of the proof 
can be found in, e.g., [17,38]. Matrix A is given by the following equation: 


(E~ l x E bx )A = AA. 


(7.72) 


Matrix A is a diagonal matrix having as elements m of the eigenvalues (Appendix A) of the / x l 
matrix, E~\E bx , where E wx and E bx are the within-class and between-class scatter matrices, respec- 
tively, in the original K 7 space. The matrix of interest. A, comprises columns that are the respective 
eigenvectors. The problem, now, becomes to select the m eigenvalues/eigenvectors. Note that by its def- 
inition, E b , being the sum of M related (via /r 0 ) rank one matrices, is of rank M — 1 (Problem 7.15). 
Thus, the product E~ x E bx has only M — 1 nonzero eigenvalues. This imposes a stringent constraint 
on the dimensionality reduction. The maximum dimension m that one can obtain is m — M — I (for 
the two-class task, m — 1), irrespective of the original dimension /. There are two cases that are worth 
focusing on: 

• m = M — 1. In this case, it is shown that if A is formed having as columns ali the eigenvectors 
corresponding to the nonzero eigenvalues, then 


■A v — - 
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In other words, there is no loss of information (as measured via the /3 criterion) by reducing the 
dimension from / to M — I! Note that in this case, Fisher’s method produces m = M — 1 discrimi¬ 
nant (linear) functions. This complies with a general resuit in classification stating that the minimum 
number of discriminant functions needed for an M-classification problem is M — 1 [ 38 ]. Recall that 
in Bayesian classification, we need M functions, P(coi\x), i = 1, 2,..., M; however, only M — 1 
of those are independent, because they must ali add to one. Hence, Fisher’s method provides the 
minimum number of linear discriminants required. 

• m < M — 1. If A is built having as columns the eigenvectors corresponding to the maximum m 
eigenvalues, then 

j$y < Jix- 

However, the resulting value J^ y is the maximum possible one. 

Remarks 7.5. 

• If /3 is used with other matrix combinations, as might be achieved by using E m in place of E/-,, the 
constraint of the rank being equal to M — 1 is removed, and larger values for m can be obtained. 

• In a number of practical cases, Z w may not be invertible. This is, for example, the case in the 
small sample size problems, where the dimensionality of the feature space, /, may be larger than 
the number of the training data, N. Such problems may be encountered in applications such as web 
document classification, gene expression profiling, and face recognition. There are different escape 
routes in this problem; see [38] for a discussion and related references. 


7.8 CLASSIFICATION TREES 

Classification trees are based on a simple yet powerful idea, and they are among the most popular 
techniques for classification. They are multistage systems, and classification of a pattern into a class 
is achieved sequentially. Through a series of tests, classes are rejected in a sequential fashion until 
a decision is finally reached in favor of one remaining class. Each one of the tests, whose outcome 
decides which classes are rejected, is of a binary “Yes” or “No” type and is applied to a single feature. 
Our goal is to present the main philosophy around a special type of trees known as ordinary binary 
classification trees (OBCTs). They belong to a more general class of methods that construet trees, 
both for classification and for regression, known as classification and regression trees (CARTs) [2,3 1 ]. 
Variants of the method have also been proposed [35]. 

The basic idea around OBCTs is to partition the feature space into ( hyper)rectangles\ that is, the 
space is partitioned via hyperplanes, which are parallel to the axes. This is illustrated in Fig. 7.12. 
The partition of the space in (hyper)rectangles is performed via a series of “questions” of this form: is 
the value of the feature x; < al This is also known as the splitting criterion. The sequence of questions 
can nicely be realized via the use of a tree. Fig. 7.13 shows the tree corresponding to the case illustrated 
in Fig. 7.12. Each node of the tree performs a test against an individual feature, and if it is not a leaf 
node, it is connected to two descendant nodes: one is associated with the answer “Yes” and the other 
with the answer “No.” 

Starting from the root node, a path of successive decisions is realized until a leaf node is reached. 
Each leaf node is associated with a single class. The assignment of a point to a class is done according 
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to the label of the respective leaf node. This type of classification is conceptually simple and easily 
interpretable. For example, in a medical diagnosis system, one may start with the question, is the 
temperature high? If yes, a second question can be: is the nose runny? The process carries on until a 
final decision concerning the disease has been reached. Also, trees are useful in building up reasoning 
systems in artificial intelligence [37]. For example, the existence of specific objects, which is deduced 
via a series of related questions based on the values of certain (high-level) features, can lead to the 
recognition of a scene or of an object depicted in an image. 

Once a tree has been developed, classification is straightforward. The major challenge lies in con- 
structing the tree, by exploiting the information that resides in the training data set. The main questions 
one is confronted with while designing a tree are the following: 

• Which splitting criterion should be adopted? 

• When should one stop growing a tree and declare a node as final? 

• How is a leaf node associated with a specific class? 

Besides the above issues, there are more that will be discussed later on. 

Splitting criterion: We have already stated that the questions asked at each node are of the following 
type: 

Is xt < al 

The goal is to select an appropriate value for the threshold value a. Assume that starting from the root 
node, the tree has grown up to the current node, t. Each node t is associated with a subset X t c X of 
the training data set X. This is the set of the training points that have survived to this node, after the 
tests that have taken place at the previous nodes in the tree. For example, in Fig. 7.13, a number of 



FIGURE 7.12 


Partition of the two-dimensional features space, corresponding to three classes, via a classification (OBCT) tree. 












7.8 CLASSIFICATION TREES 


331 



FIGURE 7.13 

The classification tree that performs the space partitioning for the task indicated in Fig. 7.12. 


points, which belong to, say, class u >\, will not be involved in node t\ because they have already been 
assigned in a previously labeled leaf node. The purpose of a splitting criterion is to split X, into two 
disjoint subsets, namely, X t y and X t ^, depending on the answer to the specific question at node t. For 
every split, the following is true: 


X t y f! X t N = 0 , 

X t Y U X,n = X t . 

The goal in each node is to select which feature is to be tested and also what is the best value of the 
corresponding threshold value a. The adopted philosophy is to make the choice so that every split 
generates sets, X t y , X t N, which are more class-homogeneous compared to X r . In other words, the data 
in each one of the two descendant sets must show a higher preference to specific classes, compared to 
the ancestor set. For example, assume that the data in X, consist of points that belong to four classes, 
co \, a>2 , coj ,, a>4. The idea is to perform the splitting so that most of the data in X,y belong to, say, a>i,a>2, 
and most of the data in X t N to «3, 0)4. In the adopted terminology, the sets X,y and X,n should be 
purer compared to X,. Thus, we must first select a criterion that measures impurity and then compute 
the threshold value and choose the specific feature (to be tested) to maximize the decrease in node 
impurity. For example, a common measure to quantify impurity of node t is the entropy, defined as 

M 

I(t) = - T. P(oj m \t) log 2 P(co m \t), ( 7 . 73 ) 

m =1 

where log 2 (-) is the base-two logarithm. The maximum value of I(t ) occurs if all probabilities are equal 
(maximum impurity), and the smallest value, which is equal to zero, when only one of the probability 
values is one and the rest equal zero. Probabilities are approximated as 
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Ni n 

P((o m \t)=^~, »i=l,2,..., M, 

N, 


where N' t n is the number of the points from class m in X, and /V, is the total number of points in X,. 
The decrease in node impurity, after splitting the data into two sets, is defined as 


A I(t) = I(t) - I(t Y ) - 

N, N t 


(7.74) 


where I (ty) and / ( r,v) are the impurities associated with the two new sets, respectively. The goal now 
becomes to select the specific feature x,- and the threshold a t so that A/(f) becomes maximum. This 
will now define two new descendant nodes of t, namely, r ; y and ty; thus, the tree grows with two new 
nodes. 

A way to search for different threshold values is the following. For each one of the features 
Xi, i = 1,2,..., /, rank the values x,„, n= 1,2,, N t , which this feature takes among the training 
points in X t . Then define a sequence of corresponding threshold values, a,-„, to be halfway between 
consecutive distinet values of x;„. Then test the impurity change that occurs for each one of these 
threshold values and keep the one that achieves the maximum decrease. Repeat the process for ali 
features, and finally, keep the combination that results in the best maximum decrease. 

Besides entropy, other impurity measuring indices can be used. A popular alternative, which results 
in a slightly sharper maximum compared to the entropy one, is the so-called Gini index, defined as 


M 



(7.75) 


m —1 


This index is also zero if one of the probability values is equal to 1 and the rest are zero, and it takes 
its maximum value when all classes are equiprobable. 

Stop-splitting rule; The obvious question when growing a tree is when to stop growing it. One possible 
way is to adopt a threshold value T, and stop splitting a node once the maximum value A I(t), for all 
possible splits, is smaller than T. Another possibility is to stop when the cardinality of X t becomes 
smaller than a certain number or if the node is pure, in the sense that all points in it belong to a single 
class. 

Class assignment rule; Once a node t is declared to be a leaf node, it is assigned a class label; usually 
this is done on a majority voting rationale. That is, it is assigned the label of the class where most of 
the data in X, belong. 

Pruning a tree; Experience has shown that growing a tree and using a stopping rule does not always 
work well in practice; growing may either stop early or may resuit in trees of very large size. A common 
practice is to first grow a tree up to a large size and then adopt a pruning technique to eliminate nodes. 
Different pruning criteria can be used; a popular one is to combine an estimate of the error probability 
with a complexity measuring index (see [2,31]). 

Remarks 7.6. 


Among the notable advantages of decision trees is the fact that they can naturally treat mixtures of 
numeric and categorical variables. Moreover, they scale well with large data sets. Also, they can 
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treat missing variables in an effective way. In many domains, not ali the values of the features are 
known for every pattern. The values may have gone unrecorded, or they may be too expensive to 
obtain. Finally, due to their structural simplicity, they are easily interpretable; in other words, it is 
possible for a human to understand the reason for the output of the learning algorithm. In some 
applications, such as in financial decisions, this is a legal requirement. 

On the other hand, the prediction performance of the tree classifiers is not as good as other 
methods, such as support vector machines and neural networks, to be treated in Chapters 11 and 1 8, 
respectively. 

• A major drawback associated with the tree classifiers is that they are unstable. That is, a small 
change in the training data set can resuit in a very different tree. The reason for this lies in the 
hierarchical nature of the tree classifiers. An error that occurs in a node at a high level of the tree 
propagates ali the way down to the leaves below it. 

Bagging (bootstrap aggregating) [3] is a technique that can reduce the variance and improve the 
generalization error performance. The basic idea is to create B variants, X\ , X 2 , ..., Xb, of the 
training set X , using bootstrap techniques, by uniformly sampling from X with replacement. For 
each of the training set variants Xj , a tree 7] is constructed. The final decision for the classification 
of a given point is in favor of the class predicted by the majority of the subclassifiers 7/, i = 
1,2,...,fi. 

Random forests use the idea of bagging in tandem with random feature selection [5]. The dif- 
ference with bagging lies in the way the decision trees are constructed. The feature to split in each 
node is selected as the best among a set of F randomly chosen features, where fi is a user-defined 
parameter. This extra introduced randomness is reported to have a substantial effect in performance 
improvement. 

Random forests often have very good predictive accuracy and have been used in a number of 
applications, including body pose recognition via Microsoft’s popular Kinect sensor [34]. 

Besides the previous methods, more recently, Bayesian techniques have also been suggested and 
used to stabilize the performance of trees (see [8,44]). Of course, the effect of using multiple trees 
is losing a main advantage of the trees, that is, their fairly easy interpretability. 

• Besides the OBCT rationale, a more general partition of the feature space has also been proposed 
via hyperplanes that are not parallel to the axes. This is possible via questions of the following 
type: Is Xw =1 c,x,- < al This can lead to a better partition of the space. However, the training now 
becomes more involved (see [35]). 

• Decision trees have also been proposed for regression tasks, albeit with less success. The idea is to 
split the space into regions, and prediction is performed based on the average of the output values in 
the region where the observed input vector lies; such an averaging approach has as a consequence 
the lack of smoothness as one moves from one region to another, which is a major drawback of 
regression trees. The splitting into regions is performed based on the LS method [19]. 


7.9 COMBINING CLASSIFIERS 

So far, we have discussed a number of classifiers, and more methods will be presented in Chapters 13, 
11, and 18, concerning support vector machines, Bayesian methods, and neural/deep networks. The 
obvious question an inexperienced practitioner/researcher is confronted with is, which method then? 
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Unfortunately, there is no definitive answer. Furthermore, the choice of a method becomes a harder 
problem when the data size is rather small. The goal of this section is to discuss techniques that can 
benefit by combining different learners together. 

NO FREE LUNCH THEOREM 

The goal of the design of any classiher, and in general of any learning scheme, based on a training set 
of a finite size, is to provide a good generalization performance. However, there are no context inde- 
pendent or usage independent reasons to support one learning technique over another. Each learning 
task, represented by the available data set, will show a preference for a specihc learning scheme that 
fits the specihcities of the particular problem at hand. An algorithm that scores top in one problem can 
score low for another. This experimental hnding is theoretically substantiated by the so-called no free 
lunch theorem for machine learning [43]. 

This important theorem States that, averaged over all possible data generating distributions, every 
classihcation algorithm results in the sume error rate on data outside the training set. In other words, 
there is no learning algorithm that is universally optimal. However, note that these results hold only 
when one averages over all possible data generating distributions. If, on the other hand, when designing 
a learner, we exploit prior knowledge concerning the specihcities of the particular data set, which is of 
interest to us, then we can design an algorithm that performs well on this data set. 

In practice, one should try different learning methods from the available palette, each optimized 
to the specihc task, and test its generalization performance against an independent data set different 
from the one used for training, using, for example, the leave-one-out method or any of its variants 
(Chapter 3). Then keep and use the method that scored best for the specihc task. 

To this end, there are a number of major efforts to compare different classihers against different 
data sets and measure the “average” performance, via the use of different statistical indices, in order to 
quantify the overall performance of each classiher against the data sets. 

SOME EXPERIMENTAL COMPARISONS 

Experimental comparisons of methods always have a strong historical havor, since, as time passes, 
new methods come into existence and, moreover, new and larger data sets can be obtained, which may 
change conclusions. In this subsection, we present samples of some major big projects that have been 
completed, whose goal was to compare different learners together. The terrain today may be different 
due to the advent of deep neural networks; yet, the knowledge that can be extracted from these previous 
projects is stili useful and enlightening. 

One of the very hrst efforts, to compare the performance of different classihers, was the Statlog 
project [27]. Two subsequent efforts are summarized in [7,26]. In the former, 17 popular classihers 
were tested against 21 data sets. In the latter, 10 classihers and 11 data sets were employed. The results 
verify what has already been said: different classihers perform better for different sets. However, it is 
reported that boosted trees (Section 7.11), random forests, bagged decision trees, and support vector 
machines were ranked among the top ones for most of the data sets. 

The Neural Information Processing Systems Workshop (NIPS-2003) organized a classihcation 
competition based on hve data sets. The results of the competition are summarized in [18]. The com- 
petition was focused on feature selection [28]. In a follow-up study [22], more classihers were added. 
Among the considered classihers, a Bayesian-type neural network scheme (Chapter 18) scored at the 



7.9 COMBINING CLASSIFIERS 335 


top, albeit at significantly higher run time requirements. The other classifiers considered were random 
forests and boosting, where trees and neural networks were used as base classifiers (Section 7.10). Ran¬ 
dom forests also performed well, at much lower computational times compared to the Bayesian-type 
classifier. 

SCHEMES FOR COMBINING CLASSIFIERS 

A trend to improve performance is to combine different classifiers and exploit their individual advan- 
tages. An observation that justifies such an approach is that during testing, there are patterns on which 
even the best classifier for a particular task fails to predict their true class. In contrast, the same patterns 
can be classified correctly by other classifiers, with an inferior overall performance. This shows that 
there may be some complementarity among different classifiers, and combination can lead to boosted 
performance compared to that obtained from the best (single) classifier. Recall that bagging, mentioned 
in Section 7.8, is a type of classifier combination. 

The issue that arises now is to select a combination scheme. There are different schemes, and the 
results they provide can be different. Below, we summarize the more popular combination schemes. 

• Arithmetic averaging rule: Assuming that we use L classifiers, where each one outputs a value 
of the posterior probability Pj(u>i\x), i = 1,2,.... M, j = 1,2,..../,. A decision concerning the 
class assignment is based on the following rule: 


Assign x to class oj, = arg max — 


L 

2 >y(a*|jc), k= 1,2, 
7=1 


(7.76) 


It can be shown that this rule is equivalent to computing the “final” posterior probability, P(a>, x), 
by minimizing the average Kullback-Leibler divergence (Problem 7.16), 

1 L 

Dov = ~ ^2 Dj, 

7=1 


where 


M 

Dj = '22 Pj(a>i |x) ln 
i=1 


Pj(a>i\x) 
P{a>i |x) ' 


• Geometric averciging rule: This rule is the outcome of minimizing the alternative formulation of the 
Kullback-Leibler divergence (note that KL divergence is not symmetric); in other words, 


M 

Dj = 22 ^(<»i|jc)ln 
i=i 


P(cjQj\x) 
Pj(m\x) ’ 


which results in (Problem 7.17) 
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(7.77) 


• Stacking : An alternative way is to use a weighted average of the outputs of the individual classifiers, 
where the combination weights are obtained optimally using the training data. Assume that the 
output of each individual classifier, fj (x), is of a soft type, for example, an estimate of the posterior 
probability, as before. Then the combined output is given by 


L 

/(*) = £>;/;«, (7.78) 

7=1 

where the weights are estimated via the following optimization task: 

N N / L \ 

w = argrmn^£(y„, f(x n j) =argnun^£ y n ,^Wj fj(x n ) 1 , (7.79) 

n= 1 n=1 \ j=l ) 

where £(■, ■) is a loss function, for example, the squared error one. However, adopting the previous 
optimization, based on the training data set, can lead to overfitting. According to stacking [42], 
a cross-validation rationale is adopted and instead of fj(x n ), we employ the fj n> (x n ), where the 
latter is the output of the /th classiher trained on the data after excluding the pair ( y n . x n ). In other 
words, the weights are estimated by 


N 

( L \ 

w = arg min C 

yn^Wjfj n> (*n) 1 ■ 

W n= 1 

^ ) 


(7.80) 


Sometimes, the weights are constrained to be positive and add to one, giving rise to a constrained 
optimization task. 

• Majority voting rule: The previous methods belong to the family of soft-type rules. A popular alter¬ 
native is a hard-type rule, which is based on a voting scheme. One decides in favor of the class for 
which either there is a consensus or at least l c of the classifiers agree on the class label, where 

I ^ + 1 , L is even, 

| T+l i l is odd. 

Otherwise, the decision is rejection (i.e., no decision is taken). 

In addition to the sum, product, and majority voting, other combinations rules have also been suggested, 
which are inspired by the inequalities [24] 


L L \ L 

ff Pj(coi\x) < min Pi((Oi\x) < — Pj(a>i\x) < max P ,• (a>i\x) , (7.81) 

1 1 j =l L ' j =l 

7=1 7=1 
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and classification is achieved by using the max or min bounds instead of the sum and product. When 
outliers are present, one can instead use the median value, i.e., 


Assign x to class u>i = arg max median { Pj (&>* | jc) } , k = 1, 2, ..., M. 


(7.82) 


It turns out that the no free lunch theorem is also valid for the combination rules; there is not a univer- 
sally optimal rule. It ali depends on the data at hand (see [21]). 

There are a number of other issues related to the theory of combining classifiers; for example, how 
does one choose the classifiers to be combined? Should the classifiers be dependent or independent? 
Furthermore, combination does not necessarily imply improved performance; in some cases, one may 
experience a performance loss (higher error rate) compared to that of the best (single) classifier [20, 
21]. Thus, combining has to take place with care. More on these issues can be found in [25,38] and the 
references therein. 


7.10 THE BOOSTING APPROACH 

The origins of the boosting method for designing learning machines are traced back to the work of 
Valiant and Kearns [23,40], who posed the question of whether a weak learning algorithm, meaning 
one that does slightly better than random guessing, can be boosted into a strong one with a good per¬ 
formance index. At the heart of such techniques lies the base learner, which is a weak one. Boosting 
consists of an iterative scheme, where at each step the base learner is optimally computed using a 
different training set; the set at the current iteration is generated either according to an iteratively ob- 
tained data distribution or, usually, via a weighting of the training samples, each time using a different 
set of weights. The latter are computed in order to take into account the achieved performance up to 
the current iteration step. The final learner is obtained via a weighted average of all the hierarchi- 
cally designed base learners. Thus, boosting can also be considered a scheme for combining different 
learners. 

It turns out that, given a sufficient number of iterations, one can significantly improve the (poor) 
performance of the weak learner. For example, in some cases in classification, the training error may 
tend to zero as the number of iterations increases. This is very interesting indeed. Training a weak 
classifier, by appropriate manipulation of the training data (as a matter of fact, the weighting mechanism 
identifies hard samples, the ones that keep failing, and places more emphasis on them) one can obtain 
a strong classifier. Of course, as we will discuss, the fact that the training error may tend to zero does 
not necessarily mean the test error goes to zero as well. 

THE ADAB00ST ALGORITHM 

We now focus on the two-class classification task and assume that we are given a set of N training 
observations, (y„ ,x n ), n= 1,2,, N, with y n e {—1,1}. Our goal is to design a binary classifier. 


/(x) = sgn{F(x)}, 


(7.83) 
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where 


K 

F(x) := Y^ a k<P(x-, 0 k ), 
k= i 


(7.84) 


where <p(x\ 0 k ) £ {—1, 1} is the base classifier at iteration k, defined in terms of a set of parameters, 
0 k , k — 1,2,..., K, to be estimated. The base classifier is selected to be a binary one. The set of 
unknown parameters is obtained in a step-wise approach and in a greedy way; that is, at each iteration 
step i, we only optimize with respect to a single pair, (a;, Oj), by keeping the parameters a k ,0 k , k — 
1,2— 1, obtained from the previous steps, fixed. Note that ideally, one should optimize with 
respect to ali the unknown parameters, a k , 0 k , k = 1 , 2 ,..., K, simultaneously; however, this would 
lead to a very computationally demanding optimization task. Greedy algorithms are very popular, due 
to their computational simplicity, and lead to a very good performance in a wide range of learning 
tasks. Greedy algorithms will also be discussed in the context of sparsity-aware learning in Chapter 10. 

Assume that we are currently at the zth iteration step; consider the partial sum of terms 


Fi(-) = 9 k ). 

k= 1 


(7.85) 


Then we can write the following recursion: 

Fi(-) = Fj-i(-) + a i( p(■■ 0i), i = l,2,...,K, 


(7.86) 


starting from an initial condition. According to the greedy rationale, F,- 1 (-) is assumed to be known 
and the goal is to optimize with respect to the set of parameters a,-, 0,. For optimization, a loss function 
has to be adopted. No doubt different options are available, giving different names to the derived 
algorithm. A popular loss function used for classification is the exponential loss, defined as 


C(y, F(x)) — exp ( — yF(x)) : exponential loss function, 


(7.87) 


and it gives rise to the adaptive boosting (AdaBoost) algorithm. The exponential loss function is shown 
in Fig. 7.14, together with the 0-1 loss function. The former can be considered a (differentiable) upper 
bound of the (nondifferentiable) 0-1 loss function. Note that the exponential loss weighs misclassified 
(yF(x) < 0) points more heavily compared to the correctly identified ones (y F(x) > 0). Employing 
the exponential loss function, the set a,, 0, is obtained via the respective empirical cost function, in the 
following manner: 


(a/, 0i) = argmin exp ( - i(*„) + a<p(x n \0)) )• 

a f> v f 


(7.88) 


This optimization is also performed in two steps. First, a is treated fixed and we optimize with respect 
to 0 , 

N 

0i = arg min Y"' exp (- y„a<p(x „; 0)), 

0 

n =1 


(7.89) 
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where 

u>l‘ } := exp (- y„Fi- 1(*„)), n = 1, 2,..., N. (7.90) 

Observe that wj' 1 depends neither on a nor on <p(x n \ 6 ); hence it can be considered a weight asso- 
ciated with sample n. Moreover, its value depends entirely on the results obtained from the previous 
recursions. 

We now turn our focus on the cost in (7.89). The optimization depends on the specific form of the 
base classifier. Note, however, that the loss function is of an exponential form. Furthermore, the base 
classifier is a binary one, so that <p(x\ 6) e {—1, 1). If we assume that a > 0 (we will come back to it 
soon) optimization of (7.89) is readily seen to be equivalent to optimizing the following cost: 

Oj — argminP/, (7.91) 

6 


Pi : = ^ w ( ^X(-oo,0](y„(p(x„\ 0)), (7.92) 

n =1 

and X[-oo.0](‘) is the 0-1 loss function. In other words, only misclassified points (i.e., those for which 
y„(p(x n \ 0 ) < 0) contribute. Note that P, is the weighted empirical classification error. Obviously, when 
the misclassification error is minimized, the cost in (7.89) is also minimized, because the exponential 
loss weighs the misclassified points heavier. To guarantee that P, remains in the [0, 1] interval, the 
weights are normalized to unity by dividing by the respective sum; note that this does not affect the 
optimization process. In other words, 0, can be computed in order to minimize the empirical misclas- 



FIGURE 7.14 

The 0-1, exponential, log-loss, and squared error loss functions. They have all been normalized to cross the point 
(0, 1). The horizontal axis for the squared error corresponds to y — F(x). 


5 


The characteristic function xa( x ) i s equal to one if x e A and zero otherwise. 
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sification error committed by the base classifier. For base classifiers of very simple structure, such a 
minimization is computationally feasible. 

Having computed the optimal 0/, the following are easily established from the respective defini¬ 
tioris, 

W n } = P i (7-93) 

y n <!>(x„-,6i)< 0 

and 


J2 Wn’ = 1 - A- 

y n <t>(x n \0i)> o 


(7.94) 


Combining (7.93) and (7.94) with (7.88) and (7.90), it is readily shown that 


a,- = arg min {exp (—a) (1 — Pi) + exp (a) Pi }. 

a 1 J 


(7.95) 


Taking the derivative with respect to a and equating to zero results in 

1 1 - Pi 

ai = - In —-—-. (7.96) 

Z r i 

Note that if P, < 0.5, then a,- > 0, which is expected to be the case in practice. Once a,- and 9 i have 
been estimated, the weights for the next iteration are readily given by 


(<+D _ ex P ( - y» F i (*«)) _ w >' ) ex P (- y n ai<P(x n \e i)) 

w " — z, — z, 

where Z, is the normalizing factor 


(7.97) 


N 

z i ■= ^ w, ( ? !) exp {-y n ai<t>{Xn, 9,)). (7.98) 

n =1 

Looking at the way the weights are formed, one can grasp one of the major secrets underlying the 
AdaBoost algorithm: The weight associated with a training sample x n is increased (decreased) with 
respect to its value at the previous iteration, depending on whether the pattern has failed (succeeded) 
in being classified correctly. Moreover, the percentage of the decrease (increase) depends on the value 
of a,- , which Controls the relative importance in the buildup of the final classifier. Hard samples, which 
keep failing over successive iterations, gain importance in their participation in the weighted empir- 
ical error value. For the case of the AdaBoost, it can be shown that the training error tends to zero 
exponentially fast (Problem 7.18). The scheme is summarized in Algorithm 7.1. 

Algorithm 7.1 (The AdaBoost algorithm). 

• Initialize: ui^ 1 — j/, i = 1,2,..., N 

• Initialize: i = 1 

• Repeat 
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- Compute the optimum 0, in ({)(■', 6,) by minimizing P, ; (7.91) 

- Compute the optimum P, ; (7.92) 

- cti = 5 ln p- 

- Z{ = 0 

- For n — 1 to N Do 

* Wn +l) = w, ( /’ exp(-y n ai4>(x„\ 0;)) 

* z, = Z/ + 4 i+1) 

- End For 


For n = 1 to 7V Do 


w. 


0 + 1 ) 


„(i+l) 


/ 4 - 


- End For 


- K = i 


- i = i + 1 

• Until a termination criterion is met. 

* f(-) = sgn(Y / k=i cl k<P(-Jk)) 


The AdaBoost was first derived in [14] in a different way. Our formulation follows that given in 
[15]. Yoav Freund and Robert Schapire received the prestigious Godel award for this algorithm in 2003. 


THE L0G-L0SS FUNCTION 

ln AdaBoost, the exponential loss function was employed. From a theoretical point of view, this can 
be justified by the following argument. Consider the mean value with respect to the binary label, y, of 
the exponential loss function 

E [exp (—yF(x)) ] = P(y= 1) exp (~F(x)) + P(y — -l)exp(F(x)). (7.99) 

Taking the derivative with respect to F(x) and equating to zero, we readily obtain that the minimum 
of (7.99) occurs at 

F*(x) = argminE [exp(—y/)] = ]- ln (7.100) 

/ 2 F(y = -l|x) 

The logarithm of the ratio on the right-hand side is known as the log-odds ratio. Hence, if one views 
the minimizing function in (7.88) as the empirical approximation of the mean value in (7.99), it fully 
justifies considering the sign of the function in (7.83) as the classification rule. 

A major problem associated with the exponential loss function, as is readily seen in Fig. 7.14, is 
that it weights heavily wrongly classified samples, depending on the value of the respective margin, 
defined as 


m x := \yF(x)\. (7.101) 

Note that the farther the point is from the decision surface ( F(x) — 0), the larger the value of \ F(x)\. 
Thus, points that are located at the wrong side of the decision surface ( yF(x) < 0) and far away are 
weighted with (exponentially) large values, and their influence in the optimization process is large 
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compared to the other points. Thus, in the presence of outliers, the exponential loss is not the most 
appropriate one. As a matter of fact, in such environments, the performance of the AdaBoost can 
degrade dramatically. 

An alternative loss function is the log-loss or binomial deviance , defined as 


C(y, F(x)) := ln (l + exp ( — yF(x))) : log-loss function, 


(7.102) 


which is also shown in Fig. 7.14. Observe that its increase is almost linear for large negative values. 
Such a function leads to a more balanced influence of the loss among all the points. We will return 
to the issue of robust loss functions, that is, loss functions that are more immune to the presence of 
outliers, in Chapter 11. Note that the function that minimizes the mean of the log-loss, with respect 
to y, is the same as the one given in (7.100) (try it). However, if one employs the log-loss instead of 
the exponential, the optimization task gets more involved, and one has to resort to gradient descent or 
Newton-type schemes for optimization (see [16]). 


Remarks 7.7. 


• For comparison reasons, in Fig. 7.14, the squared error loss is shown. The squared error loss depends 
on the value (y — F(x)), which is the equivalent of the margin defined above, y F(x). Observe that, 
besides the relatively large influence that large values of error have, the error is also penalized for 
patterns whose label has been predicted correctly. This is one more justification that the LS method 
is not, in general, a good choice for classification. 

• Multiclass generalizations of the Boosting scheme have been proposed in [13,15]. In [10], regu- 
larized versions of the AdaBoost scheme have been derived in order to impose sparsity. Different 
regularization schemes are considered, including l\, I 2 , and l^. The end resuit is a family of co- 
ordinate descent algorithms that integrate forward feature induction and back-pruning. In [33], a 
version is presented where a priori knowledge is brought into the scene. The so-called AdaBoost* 
is introduced in [29], where the margin is explicitly taken into account. 

• Note that the boosting rationale can be applied equally well to regression tasks involving respective 
loss functions, such as the squared error one. A robust alternative to the squared error is the absolute 
error loss function [16]. 

• The boosting technique has attracted a lot of attention among researchers in the field in order to 
justify its good performance in practice and its relative immunity to overfitting. While the training 
error may become zero, this stili does not necessarily imply overfitting. A first explanation was 
attempted in terms of bounds, concerning the respective generalization performance. The derived 
bounds are independent of the number of iterations, K , and they are expressed in terms of the margin 
[32]. However, these bounds tend to be very loose. Another explanation may lie in the fact that each 
time, optimization is carried out with only a single set of parameters. The interested reader may find 
the discussions following the papers [4,6,15] very enlightening on this issue. 

Example 7.5. Consider a 20-dimensional two-class classification task. The data points from the first 

class ( a >\) stem from either of the two Gaussian distributions with means — [0, 0,..., 0 ] T , /t 12 = 

[1, 1,..., l] r , while the points of the second class (» 2 ) stem from the Gaussian distribution with mean 
10 10 

fi 2 = [0,..., 0, 1,..., \ \ T . The covariance matrices of all distributions are equal to the 20-dimensional 
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identity matrix. Each one of the training and the test sets consists of 300 points, 200 from u>\ (100 from 
each distribution) and 100 from on . 

For the AdaBoost, the base classifier was selected to be a stump. This is a very naive type of tree, 
consisting of a single node, and classification of a feature vector x is achieved on the basis of the value 
of only one of its features, say, x,-. Thus, if x,- < a, where a is an appropriate threshold, x is assigned 
to class u>\. If x; > a, it is assigned to class ooo ■ The decision about the choice of the specific feature, 
Xj, to be used in the classifier was randomly made. Such a classifier results in a training error rate 
slightly better than 0.5. The AdaBoost algorithm was run on the training data for 2000 iteration steps. 
Fig. 7.15 verihes the fact that the training error rate converges to zero very fast. The test error rate 
keeps decreasing even after the training error rate becomes zero and then levels off at around 0.15. 


7.11 BOOSTING TREES 


In the discussion on experimental comparison of various methods in Section 7.9, it was stated the 
boosted trees are among the most powerful learning schemes for classification and data mining. Thus, 
it is worth spending some more time on this special type of boosting techniques. 

Trees were introduced in Section 7.8. From the knowledge we have now acquired, it is not difficult 
to see that the output of a tree can be compactly written as 


J 



(7.103) 


7=1 


0.45 


0.4 


0.35 


0.3 


0.25 
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0 200 400 600 800 1000 1200 1400 1600 1800 2000 
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FIGURE 7.15 


Training and test error rate curves as a function of the number of iterations, for the case of Example 7.5. 
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where / is the number of leaf nodes, Rj is the region associated with the /th leaf after the space 
partition imposed by the tree, yj is the respective label associated with Rj (output/prediction value 
for regression), and / is our familiar characteristic function. The set of parameters, ©, consists of 
(yj, Rj), 7 = 1,2,...,/, which are estimated during training. These can be obtained by selecting an 
appropriate cost function. Also, suboptimal techniques are usually employed, in order to build up a 
tree, as the ones discussed in Section 7.8. 

In a boosted tree model, the base classiher comprises a tree. For example, the stump used in Exam- 
ple 7.5 is a very special case of a base tree classiher. In practice, one can employ trees of larger size. 
Of course, the size must not be very large, in order to be closer to a weak classiher. Usually, values of 
/ between three and eight are advisable. 

The boosted tree model can be written as 


K 

F(x) = J2 7T*;0a.), 

k=l 


(7.104) 


T(x; ©/,■) = ^ ykjXR kj (*)■ 

7=1 

Eq. (7.104) is basically the same as (7.84), with the as being equal to one. We have assumed the size 
of all the trees to be the same, although this may not be necessarily the case. Adopting a loss function 
C and the greedy rationale used for the more general boosting approach, we arrive at the following 
recursive scheme of optimization: 


N 

&, = argnhn^£(y„, F,_i(*„) + T(x n \ 0)). (7.105) 

n= 1 

Optimization with respect to 0 takes place in two steps: one with respect to yij, 7 = 1,2,...,/, given 
Rjj, and then one with respect to the regions R,j . The latter is a difhcult task and only simplihes in 
very special cases. In practice, a number of approximations can be employed. Note that in the case of 
the exponential loss and the two-class classihcation task, the above is directly linked to the AdaBoost 
scheme. 

For more general cases, numeric optimization schemes are mobilized (see [16]). The same rationale 
applies for regression trees, where now loss functions for regression, such as the squared error or the 
absolute error value, are used. Such schemes are also known as multiple additive regression trees 
(MARTs). A related implementation code for boosted trees is freely available in the R gbm package 
[30]. 

There are two critical factors concerning boosted trees. One is the size of the trees, /, and the other 
is the choice of K. Concerning the size of the trees, usually one tries different sizes, 4 < / < 8, and 
selects the best one. Concerning the number of iterations, for large values, the training error may get 
close to zero, but the test error can increase due to overfitting. Thus, one has to stop early enough, 
usually by monitoring the performance. 
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Another way to cope with overfitting is to employ shrinkage methods, which tend to be equivalent 
to regularization. For example, in the stage-wise expansion of F,{x) used in the optimization step 
(7.105), one can instead adopt the following: 


Fi(-) — Fi-i(-) + vT(-\ ©/). 


The parameter v takes small values and it can be considered as controlling the learning rate of the 
boosting procedure. Values smaller than v < 0.1 are advised. However, the smaller the value of v, the 
larger the value K should be to guarantee good performance. For more on MARTs, the interested 
reader can peruse [19]. 


PROBLEMS 


7.1 Show that the Bayesian classifier is optimal, in the sense that it minimizes the probability of 
error. 

Hint: Consider a classihcation task of M classes and start with the probability of correct label 
prediction, P(C). Then the probability of error will be P(e) — 1 — P{C). 

7.2 Show that if the data follow the Gaussian distribution in an M-class task, with equal covariance 
matrices in all classes, the regions formed by the Bayesian classifier are convex. 

7.3 Derive the form of the Bayesian classifier for the case of two equiprobable classes, when the data 
follow the Gaussian distribution of the same covariance matrix. Furthermore, derive the equation 
that describes the LS linear classifier. Compare and comment on the results. 

7.4 Show that the ML estimate of the covariance matrix of a Gaussian distribution, based on N i.i.d. 
observations, x n , n = 1,2,..., N, is given by 



where 



n= 1 


7.5 Prove that the covariance estimate 



k= 1 



defines an unbiased estimator, where 
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7.6 


7.7 

7.8 

7.9 
7.10 


7.11 

7.12 

7.13 


7.14 


7.15 

7.16 

7.17 

7.18 


Show that the derivative of the logistic link function is given by 


da{t) 

dt 


— a(t) (1 — er(t)) . 


Derive the gradient of the negative log-likelihood function associated with the two-class logistic 
regression. 

Derive the Hessian matrix of the negative log-likelihood function associated with the two-class 
logistic regression. 

Show that the Hessian matrix of the negative log-likelihood function of the two-class logistic 
regression is a positive definite matrix. 

Show that if 

_ exp(r m ) 

m_ £f=iexp (*;)’ 

the derivative with respect to tj, j = 1,2 ..... /W, is given by 


d(pin 

dtj 


— (&mj 0 /) ■ 


Derive the gradient of the negative log-likelihood for the multiclass logistic regression case. 
Derive the j, k block element of the Hessian matrix of the negative log-likelihood function for 
the multiclass logistic regression. 

Consider the Rayleigh ratio, 

0 T A0 

R =- T , 

II0II 2 

where A is a symmetric positive definite matrix. Show that R is maximized with respect to 0 if 
0 is the eigenvector corresponding to the maximum eigenvalue of A. 

Consider the generalized Rayleigh quotient, 


_ 0 t B0 
Rg ~ 0 t A0’ 

where A and B are symmetric positive definite matrices. Show that R,, is maximized with respect 
to 0 if 0 is the eigenvector that corresponds to the maximum eigenvalue of A~ l B, assuming that 
the inversion is possible. 

Show that the between-class scatter matrix E/, for an M-class problem is of rank M — 1. 

Derive the arithmetic rule for combination, by minimizing the average Kullback-Leibler diver- 
gence. 

Derive the product rule via the minimization of the Kullback-Leibler divergence, as pointed out 
in the text. 

Show that the error rate on the training set of the final classifier, obtained by boosting, tends to 
zero exponentially fast. 
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MATLAB® EXERCISES 


common covariance matrix E = 

(i) 

(ii) 


7.19 Consider a two-dimensional class problem that involves two classes, u>\ and on, which are mod- 
eled by Gaussian distributions with means fi j = [0, 0] 7 and fi 2 — [2, 2] T , respectively, and 

1 0.25 

0.25 1 

Form and plot a data set X consisting of 500 points from co\ and another 500 points from 
a>2- 

Assign each one of the points of X to either a>\ or on, according to the Bayes decision 
rule, and plot the points with different colors, depending on the class they are assigned to. 
Plot the corresponding classifier. 

Based on (ii), estimate the error probability. 

0 1 
0.005 0 

u> 2 , according to the average risk minimization rule (Eq. (7.13)), and plot the points with 
different colors, depending on the class they are assigned to. 

Based on (iv), estimate the average risk for the above loss matrix. 

Comment on the results obtained by (ii)-(iii) and (iv)-(v) scenarios. 

7.20 Consider a two-dimensional class problem that involves two classes, a>i and co 2 , which are 
modeled by Gaussian distributions with means = [0, 2] 7 and /r 2 = [0, 0] r and covariance 


(Mi) 

(iv) 


(v) 

(vi) 


Let L = 


be a loss matrix. Assign each one of the points of X to either a>\ or 


matrices IA 


(I) 

(ii) 

(iii) 

(iv) 


(v) 

(vii) 


4 

1.8 


1.8 

1 


and IA 


: 

1.2 


1.2 

1 


, respectively. 


Form and plot a data set X consisting from 5000 points from a>\ and another 5000 points 
from ea 2 - 

Assign each one of the points of X to either co\ or co 2 , according to the Bayes decision 
rule, and plot the points with different colors, according to the class they are assigned to. 
Compute the classification error probability. 

Assign each one of the points of X to either co\ or co 2 , according to the naive Bayes 
decision rule, and plot the points with different colors, according to the class they are 
assigned to. 

Compute the classification error probability for the naive Bayes classifier. 

4 0' 


0 1 


7.21 


Repeat steps (i)-(v) for the case where E\ = IA = 

(viii) Comment on the results. 

Hint: Use the fact that the marginal distributions of P(cui|x), P(a>i\x\), and P(co\\x 2 ) are also 
Gaussians with means 0 and 2 and variances 4 and 1, respectively. Similarly, the marginal dis¬ 
tributions of P(co 2 \x), P( 0)2 |a'i), and P(a> 2 \x 2 ) are also Gaussians with means 0 and 0 and 
variances 4 and 1, respectively. 

Consider a two-class, two-dimensional classification problem, where the first class (a>i) is 
modeled by a Gaussian distribution with mean /r,| = [0, 2] 7 and covariance matrix E\ = 
4 II 


1.8 1 


while the second class (» 2 ) is modeled by a Gaussian distribution with mean 


fi 2 = [0, 0] 7 and covariance matrix IA = 


4 

1.8 


1.8 

1 
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(i) 

(ii) 

(iii) 

(iv) 

(v) 


Generate and plot a training set X and a test set X test , each one consisting of 1500 points 
from each distribution. 

Classify the data vectors of X test using the Bayesian classification rule. 

Perform logistic regression and use the data set X to estimate the involved parameter vector 
6. Evaluate the classification error of the resulting classifier based on X test . 

Comment on the results obtained by (ii) and (iii). 


Repeat steps (i)-(iv) for the case where Z 2 = 


4 

- 1.8 


- 1.8 

1 


and compare the obtained 


results with those produced by the previous setting. Draw your conclusions. 

Hint: For the estimation of 0 in (iii), perform gradient descent (Eq. (7.44)) and set the learning 
parameter /i, equal to 0 . 001 . 

7.22 Consider a two-dimensional classification problem involving three classes o> \, co 2 , and u> 2 . 
The data vectors from <x>\ stem from either of the two Gaussian with means pn = [0, 3] r 


and /x 1 2 = [ 11 , — 2] t and covariance matrices L\ \ = 


0.2 0 

0 2 


and X ]2 = 


3 0 
0 0.5 


, respec- 


tively. Similarly, the data vectors from on stem from either of the two Gaussian distributions 


with means p 2 1 = [3, —2] r and p 2 2 = [2-5, 4 ] 7 and covariance matrices X 21 = 


5 0 
0 0.5 


and 


£22 = 


7 0 
0 0.5 


, respectively. Finally, a >3 is modeled by a single Gaussian distribution with 


mean /r 3 = [7, 2] T and covariance matrix X 3 = 

(i) Generate and plot a training data set X consisting of 1000 data points from o>\ (500 from 
each distribution), 1000 data points from co 2 (again 500 from each distribution), and 500 
points from 0)3 (use 0 as the seed for the initialization of the Gaussian random number 
generator). In a similar manner, generate a test data set X, est (use 100 as the seed for the 
initialization of the Gaussian random number generator). 

(ii) Generate and view a decision tree based on using X as the training set. 

(iii) Compute the classification error on both the training and the test sets. Comment briefly on 
the results. 

(iv) Prune the produced tree at levels 0 (no actual pruning), 1,_11 (in MATLAB®, trees are 

pruned based on an optimal pruning scheme that first prunes branches giving less improve- 
ment in error cost). For each pruned tree compute the classification error based on the test 
set. 

(v) Plot the classification error versus the pruned levels and locate the pruned level that gives 
the minimum test classification error. What conclusions can be drawn by the inspection of 
this plot? 

(vi) View the original decision tree as well as the best-pruned one. 

Hint : The MATLAB® functions that generate a decision tree (DT), display a DT, prune a DT, 
and evaluate the performance of a DT on a given data set, are classregtree, view, prune, and 
eval, respectively. 

7.23 Consider a two-class, two-dimensional classification problem where the classes are modeled as 
the first two classes in the previous exercise. 


8 0 
0 0.5 
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(i) Generate and plot a training set X, consisting of 100 data points from each distribution of 
each class (that is, X contains 400 points in total, 200 points from each class). In a similar 
manner, generate a test set. 

(ii) Use the training set to build a boosting classifier, utilizing as weak classifier a single-node 
decision tree. Perform 12,000 iterations. 

(iii) Plot the training and the test error versus the number of iterations and comment on the 
results. 

Hint : 

- For (i) use randn('seed' , 0) and randn(seed\ 100) to initialize the random number gen¬ 
erator for the training and the test set, respectively. 

- For (ii) use 

ens — fitensemb!e(X ', y/ AdaBoostMY, no_of_base_classifiers / Tree’), 
where X' has in its rows the data vectors, y is an ordinal vector containing the class 
where each row vector of X' belongs, AdaBoostM 1 is the boosting method used, 
no_of _base_classifiers is the number of base classifiers that will be used, and Tree 
denotes the weak classifier. 

- For (iii) use L = loss(ens , X ', y,' mode',' cumulative'), which for a given boosting classi¬ 
fier ens, returns the vector L of errors performed on X' , such that L(i ) is the error committed 
when only the first i weak classifiers are taken into account. 
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8.1 INTRODUCTION 

The theory of convex sets and functions has a rich history and has been the focus of intense study for 
over a century in mathematics. In the terrain of applied Sciences and engineering, the revival of interest 
in convex functions and optimization is traced back to the early 1980s. In addition to the increased 
Processing power that became available via the use of computers, certain theoretical developments 
were catalytic in demonstrating the power of such techniques. The advent of the so-called interior 
point methods opened a new path in solving the classical linear programming task. Moreover, it was 
increasingly realized that, despite its advantages, the least-squares (LS) method also has a number 
of drawbacks, particularly in the presence of non-Gaussian noise and the presence of outliers. It has 
been demonstrated that the use of alternative cost functions, which may not even be differentiable, can 
alleviate a number of problems that are associated with the LS methods. Furthermore, the increased 
interest in robust learning methods brought into the scene the need for nontrivial constraints, which the 
optimized solution has to respect. In the machine learning community, the discovery of support vector 
machines, to be treated in Chapter 11, played an important role in popularizing convex optimization 
techniques. 

The goal of this chapter is to present some basic notions and definitions related to convex analysis 
and optimization in the context of machine learning and signal processing. Convex optimization is a 
discipline in itself, and it cannot be summarized in one chapter. The emphasis here is on computation- 
ally light techniques with a focus on Online versions, which are gaining in importance in the context of 
big data applications. A related discussion is also part of this chapter. 

The material revolves around two families of algorithms. One goes back to the classical work of 
Von Neumann on projections on convex sets, which is reviewed together with its more recent Online 
versions. The notions of projection and related properties are treated in some detail. The method of 
projections, in the context of constrained optimization, has been gaining in popularity recently. 

The other family of algorithms that is considered builds around the notion of subgradient for opti- 
mizing nondifferentiable convex functions and generalizations of the gradient descent family, discussed 
in Chapter 5. Furthermore, a powerful tool for analyzing the performance of online algorithms for con¬ 
vex optimization, known as regret analysis, is discussed and a related case study is presented. Finally, 
some current trends in convex optimization, involving proximal and mirror descent methods are intro- 
duced. 


8.2 CONVEX SETS AND FUNCTIONS 

Although most of the algorithms we will discuss in this chapter refer to vector variables in Euclidean 
spaces, which is in line with what we have done so far in this book, the definitions and some of the 
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fundamental theorems will be stated in the context of the more general case of Hilbert spaces . 1 This is 
because the current chapter will also serve the needs of subsequent chapters, whose setting is that of 
infinite-dimensional Hilbert spaces. For those readers who are not familiar with such spaces, ali they 
need to know is that a Hilbert space is a generalization of the Euclidean one, allowing for infinite di- 
mensions. To serve the needs of these readers, the differences between Euclidean and the more general 
Hilbert spaces in the theorems will be carefully pointed out whenever this is required. 

8.2.1 CONVEXSETS 

Definition 8.1. A nonempty subset C of a Hilbert space B, C c i, is called convex if V x i, x 2 e C 
and VI e [ 0 , 1 ] the following holds true : 


x := Axi + (1 — l)x 2 e C. 


( 8 . 1 ) 


Note that if X = 1, x = xi, and if /. = 0, x — x 2 . For any other value of A. in [0, 1], x lies in the 
line segment joining x i and x 2 . Indeed, from (8.1) we can write 


x -x 2 = A(xi -x 2 ), 0 <A< 1 . 


Fig. 8.1 shows two examples of convex sets, in the two-dimensional Euclidean space, R 2 . In Fig. 8 .1 A, 
the set comprises ali points whose Euclidean (£ 2 ) norm is less than or equal to one, 


C 2 = 



Sometimes we refer to C 2 as the Ij-ball of radius equal to one. Note that the set includes all the points 
on and inside the circle. The set in Fig. 8. 1B comprises all the points on and inside the rhombus defined 
by 

C\ — {x : |xi| + |x 2 1 < 1 J. 

Because the sum of the absolute values of the components of a vector defines the i\ norm, that is, 
|| jc || i := |xi | + |x 2 |, in analogy to C 2 we call the set C\ as the l\ ball of radius equal to one. In contrast, 
the sets whose f 2 and l\ norms are equal to one, or in other words, 

C 2 = Jx : x\ + xf = 1 J, C\ — Jx : |xi| + |x 2 | = lj, 
are not convex (Problem 8.2). Fig. 8.2 shows two examples of nonconvex sets. 


1 The mathematical definition of a Hilbert space is provided in the appendix associated with this chapter and which can be 
downloaded from the book’s website. 

2 In conformity with Euclidean vector spaces and for the sake of notational simplicity, we will keep the same notation and 
denote the elements of a Hilbert space with lower case bold letters. 
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FIGURE 8.1 



(A) The l 2 ball of radius S = 1 comprises all points with Euclidean norm less than or equal to S = 1. (B) The l\ ball 
consists of all the points with l\ norm less than or equal to S = 1. Both are convex sets. 




FIGURE 8.2 

Examples of two nonconvex sets. In both cases, the point x does not lie on the same set in which x\ and xi belong. 
In (A) the set comprises all the points whose Euclidean norm is equal to one. 


8.2.2 CONVEX FUNCTIONS 

Definition 8.2. A function 

/: X C.R 1 1 —R 

is called convex if X is convex and if V x i, X 2 € X the following holds true: 


f(Xxi + (1 - X)x 2 ) < Xf{x\) + (1 - X)f(x 2 ), X e [0, 1]. 


( 8 . 2 ) 


The function is called strictly convex if (8.2) holds true with striet inequality when X e (0, 1), x i ^ 
x 2 ■ The geometric interpretation of (8.2) is that the line segment joining the points (x i. f (x \)) and 
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/(*) 



/(* 2) 


(_X2,f(x 2 )) 


*/(®i) + (1 - A)/(a7 2 ) 


/(* 1) 


/(Azi + (1 - A)x 2 ) 




Aa;i + (1 — A)x 2 



X 


FIGURE 8.3 


The line segment joining the points (x\ , /(ai)) and (ai, /(A 2 )) lies above the graph of /(a). The shaded region 
corresponds to the epigraph of the function. 

( X 2 , f(x 2 )) lies above the graph of /( x), as shown in Fig. 8.3. We say that a function is concave 
(strictly concave) if the negative, —/, is convex (strictly convex). Next, we state three important theo- 
rems. 

Theorem 8.1 (First-order convexity condition). Let X C IR/ be a convex set and let 



be a differentiable function. Then f is convex if and only ifW x, y e X, 


f(y)>f(x) + V T f(x)(y-x). 


(8.3) 


The proof of the theorem is given in Problem 8.3. The theorem generalizes to nondifferentiable convex 
functions; it will be discussed in this context in Section 8.10. 

Fig. 8.4 illustrates the geometric interpretation of this theorem. It means that the graph of the convex 
function is located above the graph of the affine function 


g:y^V T f(x)(y-x) + f(x), 


which detines the tangent hyperplane of the graph at the point (x, f (x)). 


Theorem 8.2 (Second-order convexity condition). Let TcR ( be a convex set. Then a twice differ¬ 
entiable function f : X 1 —> ]R is convex (strictly convex) if and only if the Hessian matrix is positive 
semidefinite (positive definite). 


The proof of the theorem is given in Problem 8.5. Recall that in previous chapters, when we dealt 
with the squared error loss function, we commented that it is a convex one. Now we are ready to justify 
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FIGURE 8.4 

The graph of a convex function is above the tangent plane at any point of the respective graph. 


this argument. Consider the quadratic function, 

f(x) -x 1 Qx + b 1 x + c, 

where Q is a positive definite matrix. Taking the gradient, we have 

V f(x) — Qx + b, 

and the Hessian matrix is equal to Q, which by assumption is positive definite, hence / is a (strictly) 
convex function. 

In the sequel, two very important notions in convex analysis and optimization are defined. 
Definition 8.3. The epigraph of a function / is defined as the set of points 


epi(/) :=|(r,r)eTxR: f(x) < r J : epigraph. 


( 8 . 4 ) 


From a geometric point of view, the epigraph is the set of ali points in R 7 xl that lie on and above 
the graph of /( jc), as indicated by the gray shaded region in Fig. 8.3. It is important to note that a 
function is convex if and only if its epigraph is a convex set (Problem 8.6). 


Definition 8.4. Given a real number f, the lower level set of function / : X C R/ i—> R, at height 
is defined as 


lev<£(/) := J jc e X : /(. jc) < £ J : level set at £. 


( 8 . 5 ) 


In words, it is the set of all points at which the function takes a value less than or equal to f. The 
geometric interpretation of the level set is shown in Fig. 8.5. It can easily be shown (Problem 8.7) that 
if a function / is convex, then its lower level set is convex for any £ e R. The converse is not true. 
We can easily check that the function f(x) = — exp(.r) is not convex (as a matter of fact it is concave) 
and all its lower level sets are convex. 
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FIGURE 8.5 

The level set at height £ comprises all the points in the interval denoted as the “red” segment on the x-axis. 


Theorem 8.3 (Local and global minimizers). Let a convex function f : X i—> K. Then if a point x if is 
a local minimizer, it is also a global one, and the set of all minimizers is convex. Further, if the function 
is strictly convex, the minimizer is unique. 

Proof. Inasmuch as the function is convex we know that, V x e X, 

f(x) > /(x*) + V r /(**)(x - x*), 
and because at the minimizer the gradient is zero, we get 

f(x) > /(x*), (8.6) 

which proves the claim. Let us now denote 

/* = min f (x). (8.7) 

X 

Note that the set of all minimizers coincides with the level set at height /*. Then, because the function 
is convex, we know that the level set lev r (/) is convex, which verifies the convexity of the set of 
minimizers. Finally, for strictly convex functions, the inequality in (8.6) is a striet one, which proves the 
uniqueness of the (global) minimizer. The theorem is also true, even if the function is not differentiable 
(Problem 8.10). □ 


8.3 PROJECTIONS ONTO CONVEX SETS 

The projection onto a hyperplane in finite-dimensional Euclidean spaces was discussed and used in the 
context of the affine projection algorithm (APA) in Chapter 5. The notion of projection will now be gen- 
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FIGURE 8.6 

The projection x t of x onto the plane minimizes the distance of x from ali the points lying on the plane. 


eralized to include any closed convex set and also in the framework of general (infinite-dimensional) 
Hilbert spaces. 

The concept of projection is among the most fundamental concepts in mathematics, and everyone 
who has attended classes in basic geometry has used and studied it. What one may not have realized 
is that while performing a projection, for example, drawing a line segment from a point to a line or a 
plane, basically he or she solves an optimization task. The point x* in Fig. 8.6, that is, the projection 
of x onto the plane H in the three-dimensional space, is that point, among all the points lying on the 
plane, whose (Euclidean) distance from x = [x\, X 2 , x 3 ] r is minimum; in other words, 

x* = min ((.X! - yi) 2 + (x 2 - yi) 2 + (x 3 - y 3 ) 2 ). (8.8) 

yeH 

As a matter of fact, what we have learned to do in our early days at school is to solve a constrained 
optimization task. Indeed, (8.8) can equivalently be written as 

x * = argmin ||jc - y|| 2 , 

y 

s.t. O T y + 0 o = 0, 

where the constraint is the equation describing the specific plane. Our goal herein focuses on general- 
izing the notion of projection, to employ it to attack more general and complex tasks. 

Theorem 8.4. Let C be a nonempty closed^ convex set in a Hilbert space EI and x e H. Then there 
exists a unique point, denoted as Pc(x ) e C, such that 


II jc — Pc(*)ll = min ll x — V || : 

projection ofx on C. 

yeC 



3 


For the needs of this chapter, it suffices to say that a set C is closed if the limit point of any sequence of points in C lies in C. 








8.3 PROJECTIONS ONTO CONVEX SETS 359 


The term Pcix ) is called the (metric)projectiori ofx onto C. Note that ifx e C, then Pc(x ) = x, since 
this makes the nomi ||jc — PcMII = 0. 

Proof The proof comprises two paths. One is to establish uniqueness and the other to establish exis- 
tence. Uniqueness is easier, and the proof will be given here. Existence is slightly more technical, and 
it is provided in Problem 8.11. 

To show uniqueness, assume that there are two points, x*,i and jc*, 2. x*. I ^ x*, 2 , such that 

|| jc — x* 11| = || jc — jc* -71| = min ||jc — y||. (8.9) 

ye c 

(a) If jr e C, then Pc(x) — x is unique, since any other point in C would make |x — P( (x)\\ > 0. 

(b) Let x £ C. Then, mobilizing the parallelogram law of the norm (Eq. (8.151) in the chapter’s ap¬ 
pendix, and Problem 8.8) we get 

II (x - x*,i) + (x - X*, 2 )|| 2 + II (x -x*,l) - (x - x*, 2 )|| 2 = 2(||x - X*, 1 1| 2 + ||x - x*, 2 || 2 ), 


or 

II2x - (x*,l +X*,2)I| 2 + ||x*,l - X*,2 II 2 = 2(||x - X*,l|| 2 + ||x - x* !2 || 2 ), 
and exploiting (8.9) and the fact that ||x* ; i — x*,2 II > 0, we have 

2 

< ||x - x*, 1 1| 2 . (8.10) 

However, due to the convexity of C, the point ^x*j + ^x *,2 lies in C. Also, by the defmition of 
projection, x*. i is the point with the smallest distance, and hence (8.10) cannot be valid. □ 

For the existence, one has to use the property of closeness (every sequence in C has its limit in C) as 
well as the property of completeness of Hilbert spaces, which guarantees that every Cauchy sequence 
in H has a limit (see chapter’s appendix). The proof is given in Problem 8.11. 

Remarks 8.1. 



• Note that if x ^ C C H, then its projection onto C lies on the boundary of C (Problem 8.12). 


Example 8.1. Derive analytical expressions for the projections of a point xei, where H is a real 
Hilbert space, onto (a) a hyperplane, (b) a halfspace, and (c) the li ball of radius S. 


(a) A hyperplane // is defined as 


H :={y: (0, y) + 6 0 = 0}, 


for some 0 e H, 0q e M, and (-, ■) denotes the respective inner product. If H breaks down to a 
Euclidean space, the projection is readily available by simple geometric arguments, and it is given 

by 


P C (x)— x - 


{0,x)+6 o 

-- v . 


projection onto a hyperplane, 


( 8 . 11 ) 
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FIGURE 8.7 

The projection onto a hyperplane in R 3 . The vector 0 should be at the origin of the axes, but it is drawn to clearly 
show that it is perpendicular to H. 



FIGURE 8.8 

Projection onto a halfspace. 


and it is shown in Fig. 8.7. For a general Flilbert space H, the hyperplane is a closed convex subset 
of H, and the projection is stili given by the same formula (Problem 8.13). 

(b) The definition of a halfspace, ll + , is given by 

^ + = {3':(^,T> + 0O>O}. (8.12) 

and it is shown in Fig. 8.8 for the R 3 case. 

Because the projection lies on the boundary if x £ H + , its projection will lie on the hyperplane 
defined by 0 and 6q, and it will be equal to x if x e H + \ thus, the projection is easily shown to be 


P H + (*)=*- 


min{0, (0,x)+e o } 

iiflp 


projection onto a halfspace. 


(8.13) 
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(c) The closed ball centered at 0 and of radius <5, denoted as /i|0. <5], in a general Hilbert space, H, is 
defined as 


B[0,8]={y:\\y\\<S}. 

The projection of x ^ /f [0. 5] onto Z? [0. 5] is given by 


^fi[0,3](*) = 



if ||a:|| < 8, 
if ||jc |j > 8, 


projection onto a closed ball. 


(8.14) 


and it is geometrically illustrated in Fig. 8.9, for the case of K 2 (Problem 8.14). 



FIGURE 8.9 

The projection onto a closed ball of radius 8 centered at the origin of the axes. 


Remarks 8.2. 

• In the context of sparsity-aware learning, which is dealt with in Chapter 10, a key point is the 
projection of a point in (C 1 ) onto the l\ ball. There it is shown that, given the size of the ball, 
this projection corresponds to the so-called soft thresholding operation (see, also, Example 8.10 for 
a definition). 

It should be stressed that a linear space equipped with the i\ norm is no more Euclidean (Hilbert), 
inasmuch as this norm is not induced by an inner product operation; moreover, uniqueness of the 
projection with respect to this norm is not guaranteed (Problem 8.15). 

8.3.1 PR0PERTIES 0F PROJECTIONS 

In this section, we summarize some basic properties of the projections. These properties are used to 
prove a number of theorems and convergence results, associated with algorithms that are developed 
around the notion of projection. 

Readers who are interested only in the algorithms can bypass this section. 
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Proposition 8.1. Let H be a Hilbert space, C f IHI be a closed convex set, and x e LHL Then the projec- 
tion P c (x) satisfies the following two properties 4 : 


Realj (x — Pc{x), y — Pc(x ))} < 0, V y e C, 


(8.15) 


and 


Pc(x) - Pc(y) || 2 < Real{(x -y,P c (x)~ P c 00>}, Vxjel. 


(8.16) 


The proof of the proposition is provided in Problem 8.16. The geometric interpretation of (8.15) for 
the case of real Hilbert space is shown in Fig. 8.10. Note that for a real Hilbert space, the hrst property 
becomes 


(x-P c (x),y-P c (x))< 0, WyeC. (8.17) 

From the geometric point of view, (8.17) means that the angle formed by the two vectors x — Pc (x) 
and y — Pc(x) is obtuse. The hyperplane that crosses Pc(x) and is orthogonal to x — Pc(x ) is known 
as supporting hyperplane and it leaves ali points in C on one side and x on the other. It can be shown 
that if C is closed and convex and x ^ C, there is always such a hyperplane (see, for example, [30]). 



FIGURE 8.10 

The vectors _y — Pc(x) and x — Pc(x) form an obtuse angle, (j>. 


Lemma 8.1. Let S be a closed subspace Sci in a Hilbert space H. Then V x, y e H, the following 
properties hold true: 


(x, P s (y)) = (Ps(x ), y) = {Ps(x), P s (y)) 


(8.18) 


4 


The theorems are stated here for the general case of complex numbers. 
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and 


P s (ax + by) = aP s (x) + bP s (y), 


(8.19) 


where a and b are arbitrary scalars. In other words, the projection operation on a closed subspace is 
a linear one (Problem 8.17). Recall that in a Euclidean space ali subspaces are closed, and hence the 
linearity is always valid. 



FIGURE 8.11 

Every point in a Hilbert space H can be uniquely decomposed into the sum of its projections on any closed subspace 
S and its orthogonal complement S -1 . 

It can be shown (Problem 8.18) that if S is a closed subspace in a Hilbert space H, its orthogonal 
complement, S -1 , is also a closed subspace, such that S fl S 1 - — {0}; by definition, the orthogonal com- 
pliment S 1 - is the set whose elements are orthogonal to each element of S. Moreover, 1HI = .S' © S^; that 
is, each element x e EI can be uniquely decomposed as 


x = P s (x) + P s , (x), x e H : for closed subspaces, 


( 8 . 20 ) 


as demonstrated in Fig. 8.11. 

Definition 8.5. Let a mapping 


T is called nonexpansive if V x , y e H 


||T(x) — r(jO|| < ||jc — j|| : nonexpansive mapping. 


( 8 . 21 ) 


Proposition 8.2. Let C be a closed convex set in a Hilbert space H. Then the associated projection 
operator 

P c : H i —*■ C 


is nonexpansive. 
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II* - y II 

X _ y 



FIGURE 8.12 

The nonexpansiveness property of the projection operator, Pc('), guarantees that the distance between two points 
can never be smaller than the distance between their respective projections on a closed convex set. 


Proofi Let ije M. Recall property (8.16), that is, 

II ^c(*) - ^cOOH 2 < Real{(A: - y, P c (x) - /fcOO)}. (8.22) 

Moreover, by employing the Schwarz inequality (Eq. (8.149) of chapter’s appendix), we get 

K* - y, Pcix) - P C (J))I < II* - tIIIIRc(*) - ^cOOH- (8.23) 

Combining (8.22) and (8.23), we readily obtain 

II•?<:(*) - Rc(j)ll < II* - y\\- (8.24) 

□ 

Fig. 8.12 provides a geometric interpretation of (8.24). The property of nonexpansiveness, as well as 
a number of its variants (for example, [6,7,30,81,82]), is of paramount importance in convex set theory 
and learning. It is the property that guarantees the convergence of an algorithm, which comprises a 
sequence of successive projections (mappings), to the so-called fixedpoint set, that is, to the set whose 
elements are left unaffected by the respective mapping T , i.e., 


Fix(T) = {x e H : T (jc) = x} : fixed point set. 


In the case of a projection operator on a closed convex set, we know that the respective fixed point set 
is the set C itself, since Pc(x) = x, Vx e C. 

Definition 8.6. Fet C be a closed convex set C in a Hilbert space. An operator 

T c : H h —>C 


is called relaxed projection if 


T c := I + p(P c - I), p e (0,2), 
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FIGURE 8.13 

Geometric illustration of the relaxed projection operator. 


or in other words, Vrei, 


Tc(x ) = x + fx[Pc(x) — x), jJL e (0, 2): relaxed projection on C. 

We readily see that for /x = 1, Tcix) = Pc(x). Fig. 8.13 shows the geometric illustration of the 
relaxed projection. Observe that for different values of /x e (0, 2), the relaxed projection traces ali 
points in the line segment joining the points x and x + 2 (Pc(x) — x). Note that 

Tc(x) = x, Wx e C, 

that is, Fix(7c) = C. Moreover, it can be shown that the relaxed projection operator is also nonexpan- 
sive, that is (Problem 8.19), 


T c (x)-T c (y)\\ < \\x-y\U V/x e (0, 2). 


A final property of the relaxed projection, which can also be easily shown (Problem 8.20), is the 
following: 


VyeC, ||7c(x) —y|| <||x —y| 


n\\Tc(x) ■ 


V = 


P 


P 


(8.25) 


Such mappings are known as r]-nonexpansive or strongly attracting mappings; it is guaranteed that the 
distance ||7c(ar) — y|| is smaller than ||jc — y|| at least by the positive quantity r]\\Tc(x) — x|| 2 ; that is, 
the fixed point set Fix(7’f) = C strongly attracts x. The geometric interpretation is given in Fig. 8.14. 


8.4 FUNDAMENTAL THEOREM OF PROJECTIONS ONTO CONVEX SETS 

In this section, one of the most celebrated theorems in the theory of convex sets is stated: the funda- 
mental theorem of projections onto convex sets (POCS). This theorem is at the heart of a number of 
powerful algorithms and methods, some of which are described in this book. The origin of the theorem 
is traced back to Von Neumann [98], who proposed the theorem for the case of two subspaces. 
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X 



FIGURE 8.14 

The relaxed projectiori is a strongly attracting mapping; Tc(x ) is closer to any point y e C = Fix(7c) than the point 
x is. 


Von Neumann was a Hungarian-born American of Jewish descent. He was a child prodigy who 
earned his PhD at age 22. It is difficult to summarize his numerous significant contributions, which 
range from pure mathematics to economics (he is considered the founder of the game theory) and from 
quantum mechanics (he laid the foundations of the mathematical framework in quantum mechanics) to 
computer Science (he was involved in the development of ENIAC, the first general purpose electronic 
computer). He was heavily involved in the Manhattan project for the development of the hydrogen 
bomb. 

Let Ck, k = 1, 2,..., K, be a finite number of closed convex sets in a Hilbert space H, and assume 
that they share a nonempty intersection, 


K 

C = f]C k ^0. 

k= 1 

Let Tc k , k = 1, 2,..., K, be the respective relaxed projection mappings 

T Ck = I + dk(Pc k ~ /), Hk e (0,2), k=l,2,...,K. 

Form the concatenation of these relaxed projections, 

T:=T Ck T Ck _,-T Cl , 

where the specific order is not important. In words, T comprises a sequence of relaxed projections, 
starting from Cj. In the sequel, the obtained point is projected onto Ci, and so on. 


Theorem 8.5. Let C/, , k — 1,2,..., K, be closed convex sets in a Hilbert space H, with nonempty 
intersection. Thenfor any jcq £ H, the sequence (T"(x o)), n — 1,2,..., converges weakly to a point 
in C = f\=i Ck- 
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The theorem [17,42] States the notion of weak convergence. When EI becomes a Euclidean (finite- 
dimensional) space, the notion of weak convergence coincides with the familiar “Standard” definition 
of (strong) convergence. Weak convergence is a weaker version of strong convergence, and it is met 
in infinite-dimensional spaces. A sequence x n ei is said to converge weakly to a point x* € M if, 
Vyel, 

(x„,y) -> {x*,y), 

n->o o 

and we write 

w 

X „ -> X*. 

n —>-oo 

As already said, in Euclidean spaces, weak convergence implies strong convergence. This is not neces- 
sarily true for general Hilbert spaces. On the other hand, strong convergence always implies weak con¬ 
vergence (for example, [87]) (Problem 8.21). Fig. 8.15 gives the geometric illustration of the theorem. 

The proof of the theorem is a bit technical for the general case (for example, [87]). However, it 
can be simplified for the case where the involved convex sets are closed subspaces (Problem 8.23). At 
the heart of the proof lie (a) the nonexpansiveness property of 7 q . k — 1 . 2 ,..., K . which is retained 
by T, and (b) the fact that the fixed point set of T is Fix(7’) = P|/f=i Q- . 

Remarks 8.3. 

• In the special case where ali C*, k = 1,2,.... K, are closed subspaces, we have 

T'\x 0 ) —> P c (xo). 

In other words, the sequence of relaxed projections converges strongly to the projection of xo on C. 
Recall that if each Ck,k — 1, 2,..., K, is a closed subspace of H, it can easily be shown that their 



FIGURE 8.15 

Geometric illustration of the fundamental theorem of projections onto convex sets (POCS), for Tq = Pc t , 
i = 1.2 (g-Ci = !)■ The closed convex sets are the two straight lines in M 2 . Observe that the sequence of projec¬ 
tions tends to the intersection of H\,H 2 - 
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intersection is also a closed subspace. As said before, in a Euclidean space R / , all subspaces are 
closed. 

• The previous statement is also true for linear varieties. A linear variety is the translation of a sub¬ 
space by a constant vector a. That is, if S is a subspace and «ei, then the set of points 

S'« = {.y:.y = 0 + -*h X e 5} 

is a linear variety. Hyperplanes are linear varieties (see, for example, Fig. 8.16). 

• The scheme resulting from the POCS theorem, employing the relaxed projection operator, is sum- 
marized in Algorithm 8.1, 

Algorithm 8.1 (The POCS algorithm). 

- Initialization. 

* Select xq e H. 

* Select /xj: e (0, 2), k = 1,2,..., K 

- For ii — 1,2,..., Do 

* ■£(),« = x n —i 

* For k = 1,2, ..., K, Do 

Xk,n = Xk—l,n "E l^k (PCk(xk—l.n) Xk— 1,«) (8.26) 

* End For 

* Xn = XK,n • 

- End For 



FIGURE 8.16 

A hyperplane (not Crossing the origin) is a linear variety; P$ a and Ps are the projections of x onto S a and S, respec- 
tively. 
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8.5 A PARALLEL VERSION OF POCS 

In [68], a parallel version of the POCS algorithm was stated. In addition to its computational advantages 
(when parallel processing can be exploited), this scheme will be our vehicle for generalizations to the 
Online processing, where one can cope with the case where the number of convex sets becomes infinite 
(or very large in practice). The proof for the parallel POCS is slightly technical and relies heavily on the 
results stated in the previous section. The concept behind the proof is to construet appropriate product 
spaces, and this is the reason that the algorithm is also referred to as POCS in product spcices. For the 
detailed proof, the interested reader may consuit [68]. 

Theorem 8.6. Let C*, k = 1,2,..., K, be closed convex sets in a Hilbert space EL Then, for any 
jcq e H, the sequence x„, defined as 


X n 


Xfi—l “E Pn 



weakly converges to a point in H/,—! tf 


0 < p n < M, 


and 


K 


Mn :=£ 

k=l 


COk\\Pc k (X n - 1) - X n - ll|~ 

^£=1 tX>kPCii (Xn — 1) X n — 1 


where Wk > 0, k = 1,2,..., K, such that 


(8.27) 


(8.28) 


K 

£®* = l 
k= 1 

Update recursion (8.27) says that at each iteration, all projections on the convex sets take place 
concurrently, and then they are convexly combined. The extrapolation parameter p n is chosen in inter- 
val (0, M„], where M n is recursively computed in (8.28), so that convergence is guaranteed. Fig. 8.17 
illustrates the updating process. 


8.6 FROM CONVEX SETS TO PARAMETER ESTIMATION AND MACHINE 
LEARNING 

Let us now see how this elegant theory can be turned into a useful tool for parameter estimation in 
machine learning. We will demonstrate the procedure using two examples. 


8.6.1 REGRESSION 

Consider the regression model, relating input-output observation points, 
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FIGURE 8.17 

The parallel POCS algorithm for the case of two (hyperplanes) lines in R 2 . At each step, the projections on H\ and 
Hi are carried out in parallel and then they are convexly combined. 



FIGURE 8.18 

Each pair of training points, (y n ,x n ), defines a hyperslab in the parameters’ space. 


y n =0lx„ + T] n , (y n ,x„)eRxR', n=l,2,...,N, (8.29) 

where 0 O is the unknown parameter vector. Assume that >/„ is a bounded noise sequence, that is, 

Vn I < e- (8.30) 

Then (8.29) and (8.30) guarantee that 

\y n — xf t 0 o \ < f- (8.31) 

Consider now the following set of points: 


S € = [0 : | y„ — xj t 0\ < e} : hyperslab. 


(8.32) 


This set is known as a hyperslab, and it is geometrically illustrated in Fig. 8.18. The definition is 
generalized for any H by replacing the inner product notation as (x n ,0). The set comprises all the 
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points that lie in the region formed by the two hyperplanes 

x^O - y n = e, 

x^O — y n = -e. 

This region is trivially shown to be a closed convex set. Note that every pair of training points, 

(y n ,x ,,), n — 1,2. N, defines a hyperslab of different orientation (depending on x n ) and posi- 

tion in space (determined by y n ). Moreover, (8.31) guarantees that the unknown, 0 o , lies within ali 
these hyperslabs; hence, 0 o lies in their intersection. Ali we need to do now is derive the projection 
operator onto hyperslabs (we will do it soon) and use one of the POCS schemes to find a point in the 
intersection. Assuming that enough training points are available and that the intersection is “small” 
enough, any point in this intersection will be “close” to 0„. Note that such a procedure is not based 
on optimization arguments. Recall, however, that even in optimization techniques, iterative algorithms 
have to be used, and in practice, iterations have to stop after a finite number of steps. Thus, one can 
only approximately reach the optimal value. More on these issues and related convergence properties 
will be discussed later in this chapter. 

The obvious question now is what happens if the noise is not bounded. There are two answers to 
this point. First, in any practical application where measurements are involved, the noise has to be 
bounded. Otherwise, the circuits will be burned out. So, at least conceptually, this assumption does 
not conflict with what happens in practice. It is a matter of selecting the right value for e. The second 
answer is that one can choose e to be a few times the Standard deviation of the assumed noise model. 
Then 0 O will lie in these hyperslabs with high probability. We will discuss strategies for selecting e, 
but our goal in this section is to discuss the main rationale in using the theory in practical applications. 
Needless to say, there is nothing divine around hyperslabs. Other closed convex sets can also be used 
if the nature of the noise in a specific application suggests a different type of convex sets. 



FIGURE 8.19 

The linear e-insensitive loss function r = y — d T x. Its value is zero if |r| < e, and it increases linearly for |r| > e. 

It is now interesting to look at the set where the solution lies, in this case at the hyperslab, from a 
different perspective. Consider the loss function 

£(v, 0 T x) = max (0, \y — x T 0\ — e) : linear e-insensitive loss function, 


(8.33) 
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which is illustrated in Fig. 8.19 for the case 0 e R. This is known as linear e-insensitive loss function, 
and it has been popularized in the context of support vector regression (Chapter 11). For ali Os which lie 
within the hyperslab defined in (8.32), the loss function scores a zero. For points outside the hyperslab, 
there is a linear increase of its value. Thus, the hyperslab is the zero level set of the linear e-insensitive 
loss function, defined locally according to the point (y n ,x„). Thus, although no optimization concept 
is associated with POCS, the choice of the closed convex sets can be made to minimize “locally,” at 
each point, a convex loss function by selecting its zero level set. 

We conclude our discussion by providing the projection operator of a hyperslab, S € . It is trivially 
shown that, given 0, its projection onto S € (defined by (y „, x n , e)) is given by 

Ps e =0 + fio(y n ,x„)x n , (8.34) 


if (x n ,0) - y n < -e, 

if |(*,0) -y n \ <e, (8.35) 

if (x„,0) - y n > e. 

That is, if the point lies within the hyperslab, it coincides with its projection. Otherwise, the projection 
is on one of the two hyperplanes (depending on which side of the hyperslab the point lies on), which 
define S € . Recall that the projection of a point lies on the boundary of the corresponding closed convex 
set. 


where 


fio (yn > Xn) — 


y„-{x n ,0)-e 
ll*nll 2 ’ 

0 , 

y„-(x„,0)+e 


y n 0 T x n =p 



FIGURE 8.20 


Each training point (y„,x n ) defines a halfspace in the parameters 0-space, and the linear classifier will be searched 
in the intersection of ali these halfspaces. 
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8.6.2 CLASSIFICATION 

Let us consider the two-class classiiication task, and assume that we are given the set of training points 
( y n ,x n ),n= 1,2, 

Our goal now will be to design a linear classifier so as to score 

0 T x n >p, ify n = +l, 

and 

0 T x n < -p, if y n = -1. 

This requirement can be expressed as follows. Given (y„, x n ) e { — 1, 1} x R /+1 , design a linear classi¬ 
fier, 5 0 e R ,+1 , such that 

y n 0 T x n >p> 0. (8.36) 

Note that, given y n ,x n , and p, (8.36) defines a halfspace (Example 8.1); this is the reason that we used 
“> p” rather than a striet inequality. In other words, all 0 s which satisfy the desired inequality (8.36) 
lie in this halfspace. Since each pair (y n ,x n ), n= 1,2,, N, defines a single halfspace, our goal 
now becomes that of trying to find a point at the intersection of all these halfspaces. This intersection 
is guaranteed to be nonempty if the classes are linearly separable. Fig. 8.20 illustrates the concept. The 
more realistic case of nonlinearly separable classes will be treated in Chapter 11, where a mapping in a 
high-dimensional (kernel) space makes the probability of two classes being linearly separable to tend 
to 1 as the dimensionality of the kernel space goes to infinity. 

The halfspace associated with a training pair (y n , x„) can be seen as the level set of height zero of 
the so-called hinge loss function, defined as 


C p {y, 0 1 jc) = max (0, p — y0 l x) : hinge loss function, 


(8.37) 


whose graph is shown in Fig. 8.21. Thus, choosing the halfspace as the closed convex set to represent 
x n ) is equivalent to selecting the zero level set of the hinge loss, “adjusted” for the point (y n , x n ). 

Remarks 8.4. 

• In addition to the two applications typical of the machine learning point of view, POCS has been 
applied in a number of other applications; see, for example, [7,24,87,89] for further reading. 

• If the involved sets do not intersect, that is, Ck — 0- then it has been shown [25] that the 

parallel version of POCS in (8.27) converges to a point whose weighted squared distance from 
each one of the convex sets (defined as the distance of the point from its respective projection) is 
minimized. 

• Attempts to generalize the theory to nonconvex sets have also been made (for example, [87] and 
more recently in the context of sparse modeling in [83]). 


5 Recall from Chapter 3 , that this formulation covers the general case where a bias term is involved, by increasing the dimen¬ 
sionality of x n and adding 1 as its last element. 
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FIGURE 8.21 

The hinge loss function. For the classification task r = y6 T x, its value is zero if r > p, and it increases linearly for 
r < p. 

• When C := Ck ^ 0, we sa y that the problem is feasible and the intersection C is known 
as the feasibility set. The closed convex sets Ck, k= 1,2 ,K, are sometimes called the prop- 
erty sets, for obvious reasons. In both previous examples, namely, regression and classification, we 
commented that the involved property sets resulted as the 0-level sets of a loss function C. Hence, 
assuming that the problem is feasible (the cases of bounded noise in regression and linearly sep- 
arable classes in classification), any solution in the feasible set C will also be a minimizer of the 
respective loss functions in (8.33) and (8.37), respectively. Thus, although optimization did not en- 
ter into our discussion, there can be an optimizing flavor in the POCS method. Moreover, note that 
in this case, the loss functions need not be differentiable and the techniques we discussed in the 
previous chapters are not applicable. We will retum to this issue in Section 8.10. 


8.7 INFINITELY MANY CLOSED CONVEX SETS: THE ONLINE LEARNING 
CASE 

In our discussion so far, we have assumed a finite number, K, of closed convex (property) sets. To 
land at their intersection (feasibility set) one has to cyclically project onto all of them or to perform 
the projections in parallel. Such a strategy is not appealing for the Online processing scenario. At every 
time instant, a new pair of observations becomes available, which detines a new property set. Hence, 
in this case, the number of the available convex sets increases. Visiting all the available sets makes 
the complexity time dependent and after some time the required computational resources will become 
unmanageable. 

An alternative viewpoint was suggested in [101-103], and later on extended in [76,104,105]. The 
main idea here is that at each time instant n, a pair of output-input training data is received and a 
(property) closed convex set, C„, is constructed. The time index, n, is left to grow unbounded. How- 
ever, at each time instant, q (a user-defined parameter) most recently constructed property sets are 
considered. In other words, the parameter q delines a sliding window in time. At each time instant, 
projections/relaxed projections are performed within this time window. The rationale is illustrated in 
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Time n 



Time n + 1 


FIGURE 8.22 


At time n , the property sets C„_ ?+ i,_C„ are used, while at time n + 1, the sets C n - q + 2 ,_C, 1+ i are consid- 

ered. Thus, the required number of projections does not grow with time. 

Fig. 8.22. Thus, the number of sets onto which projections are performed does not grow with time, 
their number remains finite, and it is fixed by the user. The developed algorithm is an offspring of the 
parallel version of POCS and it is known as adaptive projected subgradient method (APSM). We will 
describe the algorithm in the context of regression. Following the discussion in Section 8.6, as each 
pair (y n , x n ) e R. x M 1 becomes available, a hyperslab S € n ,n= 1,2,..., is constructed and the goal is 
to find a 0 e K / that lies in the intersection of all these property sets, starting from an arbitrary value, 
0o eR 1 . 

Algorithm 8.2 (The APSM algorithm). 

• Initialization 

- Choose^osR 7 - 

- Choose q \ The number of property sets to be processed at each time instant. 

• For n — 1, 2,..., q — 1, Do; Initial period, that is, n < q. 

- Choose ^- = i a>k = 1, > 0 

- Select fj. n 



(8.38) 


• End For 

• For n = q, q + 1,..., Do 


- Choose a > n ,..., co n - q +\; usually u>k — -,k = n — q + 1 

- Select /z„ 



(8.39) 


• End For 


The extrapolation parameter can now be chosen in the interval (0, 2 M n ) in order for convergence 
to be guaranteed. For the case of (8.39) 



(8.40) 
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FIGURE 8.23 

At time n, q =2 hyperslabs are processed, namely, 6 n -\ is concurrently projected onto both of them 

and the projections are convexly combined. The new estimate is 6 n . Next, S fj „+i “arrives” and the process is 
repeated. Note that at every time, the estimate gets closer to the intersection; the latter will become smaller and 
smaller, as more hyperslabs arrive. 


Note that this interval differs from that reported in the case of a finite number of sets in Eq. (8.27). 
For the first iteration steps associated with Eq. (8.38), the summations in the above formula starts from 
k = 1 instead of k — n — q + 1. 

Recall that Pg k is the projection operation given in (8.34)-(8.35). Note that this is a generic scheme 
and can be applied with different property sets. Ali that is needed is to change the projection operator. 
For example, if classification is considered, ali we have to do is to replace S frl by the halfspace /7+, 
defined by the pair (y n ,x n ) e { — 1, 1} x R /+1 , as explained in Section 8.6.2, and use the projection 
from (8.13) (see [78,79]). At this point, it must be emphasized that the original APSM form (e.g., [103, 
104]) is more general and can cover a wide range of convex sets and functions. 

Fig. 8.23 illustrates geometrically the APSM algorithm. We have assumed that the number of hy¬ 
perslabs that are considered for projection at each time instant is q = 2. Each iteration comprises: 

• q projections, which can be carried out in parallel, 

• their convex combination, and 

• the update step. 

8.7.1 CONVERGENCE OF APSM 

The proof of the convergence of the APSM is a bit technical and the interested reader can consuit the 
related references. Here, we can be content with a geometric illustration that intuitively justifies the 
convergence, under certain assumptions. This geometric interpretation is at the heart of a stochastic 
approach to the APSM convergence, which was presented in [23]. 

Assume the noise to be bounded and that there is a true 0 O that generates the data, that is, 


) 'n — X n 0 O ~\~ *1 h • 


( 8 . 41 ) 
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By assumption, 


\Vn\ <<?■ 


Hence, 

\x T n 0 o -y n \ <€. 

Thus, 0 O does lie in the intersection of ali the hyperslabs of the form 


\xj,e -y n \ < e, 


and in this case the problem is feasible. The question that is raised is how close one can go, even 
asymptotically as n —> oo, to 0„. For example, if the volume of the intersection is large, even if the 
algorithm converges to a point in the boundary of this intersection this does not necessarily say much 
about how close the solution is to the true value 0 O . The proof in [23] establishes that the algorithm 
brings the estimate arbitrarily close to 0 O , under some general assumptions concerning the sequence 
of observations, and that the noise is bounded. 

To understand what is behind the technicalities of the proof, recall that there are two main geometric 
issues concerning a hyperslab: (a) its orientation, which is determined by x n , and (b) its width. In 
finite-dimensional spaces, it is a matter of simple geometry to show that the width of a hyperslab is 
equal to 


d = 



(8.42) 


This is a direct consequence of the fact that the distance 6 of a point, say, 0, from the hyperplane defined 
by the pair (y, Jt), that is, 

x T 0 -y = 0, 


is equal to 

\x T 0 - y\ 

\\x\\ 

Indeed, let 6 be a point on one of the two boundary hyperplanes (e.g., x l n 0 — y„ — e) which define 
the hyperslab and consider its distance from the other one (xj t 0 — y n = —e); then (8.42) is readily 
obtained. 

Fig. 8.24 shows four hyperslabs in two different directions (one for the full lines and one for the 
dotted lines). The red hyperslabs are narrower than the black ones. Moreover, all four necessarily 
include 0 O . If x n is left to vary randomly so that any orientation will occur, with high probability and for 
any orientation, the norm can also take small as well as arbitrarily large values, and then intuition says 
that the volume of the intersection around 0 O will become arbitrarily small. Further results concerning 
the APSM algorithm can be found in, e.g., [82,89,103,104]. 


6 


For Euclidean spaces, this can be easily established by simple geometric arguments; see also Section 11.10.1. 
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2e 




FIGURE 8.24 

For each direction, the width of a hyperslab varies inversely proportional to ||x„ ||. In this figure, | |x„ 11 < | |x,„ 11 
although both vectors point to the same direction. The intersection of hyperslabs of different directions and widths 
renders the volume of their intersection arbitrarily small around 6 0 . 


Some Practical Hints 

The APSM algorithm needs the setting of three parameters, namely, e, /i„, and q. It turns out that the 

algorithm is not particularly sensitive in their choice: 

• The choice of the parameter /x„ is similar in concept to the choice of the step size in the LMS 
algorithm. In particular, the larger /i n , the faster the convergence speed, at the expense of a higher 
steady-state error floor. In practice, a step size approximately equal to 0.5M„ will lead to a low 
steady-state error, although the convergence speed will be relatively slow. On the contrary, if one 
chooses a larger step size, 1.5 M n approximately, then the algorithm enjoys a faster convergence 
speed, although the steady-state error after convergence is increased. 

• Regarding the parameter e. A typical choice is to set e ~ *j2a , where er is the Standard deviation 
of the noise. In practice (see, e.g., [47]), it has been shown that the algorithm is rather insensitive to 
this parameter. Hence, one needs only a rough estimate of the Standard deviation. 

• Concerning the choice of q , this is analogous to the q used for the APA in Chapter 5. The larger 
q is, the faster the convergence becomes; however, large values of q increase complexity as well 
as the error floor after convergence. In practice, relatively small values of q, for example, a small 
fraction of l, can significantly improve the convergence speed compared to the normalized least- 
mean-squares (NLMS) algorithm. Sometimes one can start with a relatively large value of q , and 
once the error decreases, q can be given smaller values to achieve lower error floors. 

It is important to note that the past data reuse within the sliding window of length q in the APA 
is implemented via the inversion of a q x q matrix. In the APSM, this is achieved via a sequence of 
q projections, leading to a complexity of linear dependence on q: moreover, these projections can 
be performed in parallel. Furthermore, the APA tends to be more sensitive to the presence of noise, 
since the projections are carried out on hyperplanes. In contrast, for the APSM case, projections are 
performed on hyperslabs, which implicitly care for the noise (for example, [102]). 

Remarks 8.5. 

• If the hyperslabs collapse to hyperplanes (e = 0) and q — 1, the algorithm becomes the NLMS. 
Indeed, for this case the projection in (8.39) becomes the projection on the hyperplane, H, defined 











8.7 INFINITELY MANY CLOSED CONVEX SETS 379 


by (y n ,x„), thatis, 

xj,0 = y n , 

and from (8.11), after making the appropriate notational adjustments, we have 


P H (O n -l) — 0 n -l 


X^On-i 


■yn 


Plugging (8.43) into (8.39), we get 


(8.43) 


— @n— 1 “E , t’nXn ■ 

v 


&n — } ? n X n 0 n — i, 


which is the normalized LMS, introduced in Section 5.6.1. 

• Closely related to the APSM algorithmic family are the set-membership algorithms (for example, 
[29,32-34,61]). This family can be seen as a special case of the APSM philosophy, where only 
special types of convex sets are used, for example, hyperslabs. Also, at each iteration step, a sin- 
gle projection is performed onto the set associated with the most recent pair of observations. For 
example, in [34,99] the update recursion of a set-membership APA is given by 


6 


n — 


6 n -i +X n (Xlx n ) l (e n -y n ), if |e„| > e, 
0 n - 1 , otherwise, 


(8.44) 


where X n — [x ni x n — i, ..., x n —q-\.\], y n — [y n , y n —i, • ■ ■, yn—q+i) > and e n — [e n ,c n — i,*.-, 
e„- q + i] r , with e„ = y n — . The stochastic analysis of the set-membership APA [34] estab- 

lishes a mean-square error (MSE) performance, and the analysis is carried out by adopting energy 
conservation arguments (Chapter 5). 


Example 8.2. The goal of this example is to demonstrate the comparative convergence performance of 
the NLMS, APA, APSM, and recursive least-squares RLS algorithms. The experiments were performed 
in two different noise settings, one for low and one for high noise levels, to demonstrate the sensitivity 
of the APA compared to the APSM. Data were generated according to our familiar model 

yn—^o Xn "E ^ln ■ 


The parameters 0 o e R 200 were randomly chosen from A r (0. 1) and then fixed. The input vectors were 
formed by a white noise sequence with samples i.i.d. drawn from AYO, 1). 

In the first experiment, the white noise sequence was chosen to have variance a 2 — 0.01. The pa¬ 
rameters for the three algorithms were chosen as p, = 1.2 and <5 = 0.001 for the NLMS, q — 30, \x — 
0.2, and 8 — 0.001 for the APA, and e — ~j2a, q — 30, and /i n — 0.5 * M n for the APSM. These pa¬ 
rameters lead the algorithms to settle at the same error floor. Fig. 8.25 shows the obtained squared error, 
averaged over 100 realizations, in dBs (101og 10 (e 2 )). For comparison, the RLS convergence curve is 
given for /1 = 1, which converges faster and at the same time settles at a lower error floor. If fi is mod- 
ified to a smaller value so that the RLS settles at the same error floor as the other algorithms, then its 
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FIGURE 8.25 

Mean-square error in dBs as a function of iterations. The data reuse (q = 30), associated with the APA and APSM, 
offers a significant improvement in convergence speed compared to the NLMS. The curves for the APA and APSM 
almost coincide in this low-noise scenario. 


convergence gets even faster. However, this improved performance of the RLS is achieved at higher 
complexity, which becomes a problem for large values of /. Observe the faster convergence achieved 
by the APA and APSM, compared to the NLMS. 

For the high-level noise, the corresponding variance was increased to 0.3. The obtained MSE curves 
are shown in Fig. 8.26. Observe that now, the APA shows an inferior performance compared to APSM 
in spite of its higher complexity, due to the involved matrix inversion. 


8.8 CONSTRAINED LEARNING 

Learning under a set of constraints is of significant importance in signal processing and machine learn- 
ing, in general. We have already discussed a number of such learning tasks. Beamforming, discussed 
in Chapters 5 and 4, is a typical one. In Chapter 3, while introducing the concept of overfitting, we 
discussed the notion of regularization, which is another form of constraining the norm of the unknown 
parameter vector. In some other cases, we have available a priori information concerning the unknown 
parameters; this extra information can be given in the form of a set of constraints. 

For example, if one is interested in obtaining estimates of the pixels in an image, then the values 
must be nonnegative. More recently, the unknown parameter vector may be known to be sparse; that 
is, only a few of its components are nonzero. In this case, constraining the respective l\ norm can 
significantly improve the accuracy as well as the convergence speed of an iterative scheme toward the 
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FIGURE 8.26 

Mean-square error in dBs as a function of iterations for a high-noise scenario. Compared to Fig. 8.25, ali curves 
settled at higher noise levels. Moreover, note that now the APA settles at higher error floor than the corresponding 
APSM algorithm, for the same convergence rate. 


solution. Schemes that explicitly take into consideratiori the underlying sparsity are known as sparsity 
promoting algorithms. They will be considered in more detail in Chapter 10. 

Algorithms that spring from the POCS theory are particularly suited to treat constraints in an ele¬ 
gant, robust, and rather straightforward way. Note that the goal of each constraint is to define a region 
in the solution space, where the required estimate is “forced” to lie. For the rest of this section, we will 
assume that the required estimate must satisfy M constraints, each one dehning a convex set of points, 
C m , m = 1, 2,..., M. Moreover, 


M 

P| C m ± 0, 

m= 1 

which means that the constraints are consistent (there are also methods where this condition can be 
relaxed). Then it can be shown that the mapping T, dehned as 

T:=P Cm ...P Cl , 

is a strongly attracting nonexpansive mapping, (8.25) (for example, [6,7]). Note that the same holds 
true if instead of the concatenation of the projection operators, one could convexly combine them. 

In the presence of a set of constraints, the only difference in the APSM in Algorithm 8.2 is that the 
update recursion (8.39) is now replaced by 
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0 n = T j 0„- 1 + I ^ WkPs^kiOn- 1 ) — 0„-\ 


k=n-q -\-1 





( 8 . 45 ) 


In other words, for M constraints, M extra projection operations have to be performed. The same 
applies to (8.38) with the difference being in the summation term in the brackets. 

Remarks 8.6. 

• The constrained form of the APSM has been successfully applied in the beamforming task, and 
in particular in treating nontrivial constraints, as required in the robust beamforming case [77,80, 
103,104]. The constrained APSM has also been efficiently used for sparsity-aware learning (for 
example, [47,83]; see also Chapter 10). A more detailed review of related techniques is presented 
in [89]. 


8.9 THE DISTRIBUTED APSM 


Distributed algorithms were discussed in Chapter 5. In Section 5.13.2, two versions of the diffusion 
LMS were introduced, namely, the adapt-then-combine and combine-then-adapt schemes. Diffusion 
versions of the APSM algorithm have also appeared in both configurations [20,22]. For the APSM 
case, both schemes resuit in very similar performance. 

Following the discussion in Section 5.13.2, let the most recently received data pair at node 
k = 1, 2,..., K be (yk(n), xk(n)) e M x K\ For the regression task, a corresponding hyperslab is con- 
structed, that is, 


S® = {# : \yk(n) ~xl(n)0\ < q} 


The goal is the computation of a point that lies in the intersection of all these sets, for n — 1,2,_ 

Following similar arguments as those employed for the diffusion LMS, the combine-then-adapt version 
of the APSM, given in Algorithm 8.3, is obtained. 

Algorithm 8.3 (The combine-then-adapt diffusion APSM). 

• Initialization 

- For k = 1,2,..., K, Do 

* 0 k (0) = 0 e M 1 ; or any other value. 

- End For 

- Select A : A T 1 = 1 

- Select q ; The number of property sets to be processed at each time instant. 

• For n — 1,2, .... q — 1, Do; Initial period, that is, n < q. 

- For k = 1,2,..., K, Do 

* f k {n - 1) = J2m£Af k a mk0m (« - 1); A4 the neighborhood of node k. 

- End For 

- For k — 1,2,..., K, Do 
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* Choose &>i : 12')=] coj = 1, a>j>0 

* Select /x*(«) e (0, 2Afj(n)). 

• 0k(n) = f k (n - 1) + A4-(«)(E"=i (lM n _ !)) “ fk( n ~ *)) 

- End For 

• For n = q, q + 1..., Do 

- For k = 1,2,..., K, Do 

* fk( n - 1) = UmeATk a mkO,n(n - 1) 

- End For 

- For k = 1,2,..., K, Do 

* Choose co n - q+ \: Yl)= n - q +\ w j = R 0) j > °- 

* Select Hk(n) e (0, 2 M k (n)). 

■ e k (n) = f k (n - 1) + /XiT«)( Y!)=n-q+l 0J } P S «\ {fk( n ~ !)) “ fk( n - 1)) 

- End For 

• End For 

The interval M k n is defined as 


n 

M k (n)= Y_ 

j=n—q +1 


o>j | P s (» ( f k (n - 1)) - f k {n — 1) |" 

T!j=n-q +1 VjPsf) ~ L) ) _ ^A -( n ~ D P 


and similarly for the initial period. 

Remarks 8.7. 

• An important theoretical property of the APSM-based diffusion algorithms is that they enjoy asymp- 
totic consensus. In other words, the nodes converge asymptotically to the same estimate. This 
asymptotic consensus is not in the mean, as is the case with the diffusion LMS. This is interest- 
ing, since no explicit consensus constraints are employed. 

• In [22], an extra projection step is used after the combination and prior to the adaptation step. The 
goal of this extra step is to “harmonize” the local information, which comprises the input/output 
observations, with the information coming from the neighborhood, that is, the estimates obtained 
from the neighboring nodes. This speeds up convergence, at the cost of only one extra projection. 

• A scenario in which some of the nodes are damaged and the associated observations are very noisy 
is also treated in [22]. To deal with such a scenario, instead of the hyperslab, the APSM algorithm is 
rephrased around the Huber loss function, developed in the context of robust statistics, to deal with 
cases where outliers are present (see also Chapter 11). 

Example 8.3. The goal of this example is to demonstrate the comparative performance of the dif¬ 
fusion LMS and APSM. A network of K = 10 nodes is considered, and there are 32 connections 
among the nodes. In each node, data are generated according to a regression model, using the same 
vector 0 O e K 60 . The latter was randomly generated via a normal Af(0, 1). The input vectors were 
i.i.d. generated according to the normal Af( 0, 1). The noise level at each node varied between 20 and 
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25 dBs. The parameters for the algorithms were chosen for optimized performance (after experimen- 
tation) and for similar convergence rate. For the LMS, /i = 0 . 035 , and for the APSM, e = Via, q = 
20 , fik(n) = 0.2 Mk(n). The combination weights were chosen according to the Metropolis rule and 
the data combination matrix was the identity one (no observations are exchanged). Fig. 8.27 shows the 
benefits of the data reuse offered by the APSM. The curves show the mean-square deviation (MSD 
= A 'Vk=\ 11 0k{n) — 0 0 11 2 ) as a function of the number of iterations. 



FIGURE 8.27 

The MSD as a function of the number of iterations. The improved performance due to the data reuse offered by the 
diffusion ASPM is readily observed. Moreover, observe the significant performance improvement offered by all 
cooperation schemes, compared to the noncooperative LMS; for the latter, only one node is used. 


8.10 0PTIMIZING N0NSM00TH CONVEX C0ST FUNCTI0NS 

Estimating parameters via the use of convex loss functions in the presence of a set of constraints is 
an established and well-researched field in optimization, with numerous applications in a wide range 
of disciplines. The mainstream methods follow either the Lagrange multiplier philosophy [11,15] or 
the rationale behind the so-called interior point methods [15,90]. In this section, we will focus on 
an alternative path and consider iterative schemes, which can be considered as the generalization of 
the gradient descent method, discussed in Chapter 5. The reason is that such techniques give rise to 
variants that scale well with the dimensionality and have inspired a number of algorithms, which have 
been suggested for Online learning within the machine learning and signal processing communities. 
Later on, we will move to more advanced techniques that build on the so-called opercitor/mapping and 
fixed point theoretic framework. 
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Although the stage of our discussion will be that of Euclidean spaces, M 1 , everything that will be 
said can be generalized to infinite-dimensional Hilbert spaces; we will consider such cases in Chap- 
ter 11. 

8.10.1 SUBGRADIENTS AND SUBDIFFERENTIALS 

We have already met the first-order convexity condition in ( 8 . 3 ) and it was shown that this is a sufficient 
and necessary condition for convexity, provided, of course, that the gradient exists. The condition 
basically States that the graph of the convex function lies above the hyperplanes, which are tangent at 
any point (x, /(x)) that lies on this graph. 

Let us now move a step forward and assume a function 

/ : X c R' i—-> M 

to be convex and continuous, but nonsmooth. This means that there are points where the gradient is 
not defined. Our goal now becomes that of generalizing the notion of gradient, for the case of convex 
functions. 

Definition 8.7. A vector g e W is said to be the subgradient of a convex function / at a point x e X 
if the following is true: 


f(y) > /(x) + g 1 (y — x), Vy e X : subgradient. 


( 8 . 46 ) 


It turns out that this vector is not unique. All the subgradients of a (convex) function at a point 
comprise a set. 

Definition 8.8. The subdifferential of a convex function / at x e X, denoted as 3/(x), is defined as 
the set 


3/(x) := {g e R': f(y ) > /(x) + g T (y - x), WyeX}: subdifferential. 


( 8 . 47 ) 


If / is differentiable at a point x , then 3 f(x) becomes a singleton, that is, 

3 /(x) = {V/(x)}. 

Note that if /(x) is convex, then the set 3/(x) is nonempty and convex. Moreover, /(x) is differ¬ 
entiable at a point x if and only if it has a unique subgradient [11]. From now on, we will denote a 
subgradient of / at the point x as /'(x). 

Fig. 8.28 gives a geometric interpretation of the notion of the subgradient. Each one of the subgra¬ 
dients at the point xo defines a hyperplane that supports the graph of /. At xo, there is an infinity of 
subgradients, which comprise the subdifferential (set) at xo- At xi, the function is differentiable and 
there is a unique subgradient that coincides with the gradient at x i. 

Example 8.4. Let x e R. and 


/(x) = |x|. 
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FIGURE 8.28 

At jeo, there is an infinity of subgradients, each one defining a hyperplane in the extended ( x , /(*)) space. All these 
hyperplanes pass through the point (xo, f(x o)) and support the graph of /(•). At the point x\, there is a unique 
subgradient that coincides with the gradient and detines the unique tangent hyperplane at the respective point of the 
graph. 


Show that 


9/M = 


sgn(x), ifx^O, 
ge[-l,l], if x = 0, 


where sgn(-) is the sign function, being equal to 1 if its argument is positive and — 1 if the argument is 
negative. 

Indeed, if x > 0, then 


dx 

8 =d^ = ’ 


and similarly g — — 1, if x <0. For x — 0, any g e [— 1, 1] satisfies 


g(;y-0) + 0 = gy < |y|, 


and it is a subgradient. This is illustrated in Fig. 8.29. 

Lemma 8.2. Given a convex function / lYcf 1 1 —>- R, a point x* e X is a minimizer of f if and 
only if the zero vector belongs to its subdifferential set, that is, 


0 e 9/(x*): condition for a minimizer. 


(8.48) 


Proof The proof is straightforward from the definition of a subgradient. Indeed, assume that 0 e 
9/(jc*). Then the following is valid: 

f(y) > /(**) + 0 T (y - x*), Wy e X , 
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X 


FIGURE 8.29 

All lines with slope in [— 1, 1] comprise the subdifferential at x = 0. 

and x* is a minimizer. If now x* is a minimizer, then we have 

/00 > f(x *) = /(x*) + 0 r (y - x*), 

and hence 0 e 9/(x*)- □ 

Example 8.5. Let the metric distance function 

dc{x) min ||x — jt||. 
yeC 

This is the Euclidean distance of a point from its projection on a closed convex set C, as dehned in 
Section 8.3. Then show (Problem 8.24) that the subdifferential is given by 

f x-Pc(x) . c 

ddc(x) = ll*- p c(*)ll’ ^ ’ (8.49) 

[ N c {x)nB[0, 1], xeC, 

where 

Nc(x) := {geR 1 : g T (y - x) < 0 , Vy e C), 

and 

B[0,1] := jxeR': ||x|| < l}. 

Moreover, if x is an interior point of C, then 



3rf c (*) = { 0 }. 
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Observe that for all points x ^ C as well as for ali interior points of C, the subgradient is a singleton, 
which means that dc(x ) is differentiable. Recall that the function dc(-) is nonnegative, convex, and 
continuous [44]. Note that (8.49) is also generalized to inhnite-dimensional Hilbert spaces. 

8.10.2 MINIMIZING N0NSM00TH CONTINUOUS CONVEX LOSS FUNCTIONS: 

THE BATCH LEARNING CASE 

Let / be a cost function 

J : E. 1 i—> [0, +oo), 

and let C be a closed convex set, C C W . Our task is to compute a minimizer with respect to an 
unknown parameter vector, that is, 


0 * = argmin J(0), 

6 

s.t. 0 e C, (8.50) 

and we will assume that the set of Solutions is nonempty; J is assumed to be convex, continuous, but 
not necessarily differentiable at all points. We have already seen examples of such loss function, such 
as the f-insensitive linear function in (8.33) and the hinge one (8.37). The l\ norm function is another 
example, and it will be treated in Chapters 9 and 10. 


The Subgradient Method 

Our starting point is the simplest of the cases, where C — W ; that is, the minimizing task is uncon- 
strained. The first thought that comes into mind is to consider the generalization of the gradient descent 
method, which was introduced in Chapter 5, and replace the gradient by the subgradient operation. The 
resulting scheme is known as the subgradient algorithm [74,75]. 

Starting from an arbitrary estimate, 0' <y ' e R / . the update recursions become 


0(0 = gd- 1 ) _ M/ y'( 0 ('-H) : subgradient algorithm, 


(8.51) 


where J' denotes any subgradient of the cost function, and /i, is a step size sequence judicially chosen 
so that convergence is guaranteed. In spite of the similarity in the appearance with our familiar gradient 
descent scheme, there are some major differences. The reader may have noticed that the new algorithm 
was not called subgradient “descent.” This is because the update in (8.51) is not necessarily performed 
in the descent direction. Thus, during the operation of the algorithm, the value of the cost function may 
increase. Recall that in the gradient descent methods, the value of the cost function is guaranteed to 
decrease with each iteration step, which also led to a linear convergence rate, as we have pointed out 
in Chapter 5. 

In contrast here, concerning the subgradient method, such comments cannot be stated. To establish 
convergence, a different route has to be adopted. To this end, let us define 

4° :=min{/( 6 > ( ' ) ), •/(0 (, '“ 1) )>..., y( 6 » (0) )}, (8.52) 


7 


Recall that all the methods to be reported can be extended to general Hilbert spaces, H. 
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which can also be recursively obtained by 


Then the following holds true. 

Proposition 8.3. Let J be a convex cost function. Assume that the subgradients at ali points are 
bounded, that is, 

||7'(jc)|| < G, WxeR 1 . 

Let us also assume that the step size sequence be a diminishing one, such as 

oo oo 

i= 1 1 = 1 


Then 


lim 7* (i) = J (6»*), 

i—>oo 

where 0* is a minimizer, assuming that the set ofminimizers is not empty. 


Proof. We have 


\\oV-o*\\ 2 = - nij'(e (i - l) ) - 0*|| 2 

= | 0 °'“ 1 ) - 0*| 2 -2 - 0 *) 

+ pj\\ 7 , (0 ( '“ 1) )| 2 . (8.53) 

By the definition of the subgradient, we have 

7(0*) - 7(0 (i ~ 1) ) > J ,T (0 (i ~ 1) ){ G * -0 <1 “ 1) ). (8.54) 


Plugging (8.54) in (8.53) and after some algebraic manipulations, by applying the resulting inequality 
recursively (Problem 8.25), we hnally obtain 


4° - J(o *) < 


\0W-o4 2 

2 X!jfc=l M 


J2k=l £2 
2 X!(fc=l l l k 


Leaving i to grow to infinity and taking into account the assumptions, the claim is proved. 


(8.55) 

n 


There are a number of variants of this proof. Other choices for the diminishing sequence can also 
guarantee convergence, such as //, = 1 /y/1. Moreover, in certain cases, some of the assumptions may 
be relaxed. Note that the assumption of the subgradient being bounded is guaranteed if 7 is y-Lipschitz 
continuous (Problem 8.26), that is, there is y > 0 such that 

|700-7(x)|<y||j-x||, Wx,yeR'. 
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Interpreting the propositiori from a slightly different angle, we can say that the algorithm generates a 
subsequence of estimates, 0, t . which corresponds to the values of J*\ shown as J (0j t ) < J(0, ), i < 
which converges to 0*. The best possible convergence rate that may be achieved is of the order of 
O(-kf), if one optimizes the bound in (8.55), with respect to /r,*. [64], which can be obtained if Hi = — r 

v i v( 

where c is a constant. In any case, it is readily noticed that the convergence speed of such methods is 
rather slow. Yet due to their computational simplicity, they are stili in use, especially in cases where 
the number of data samples is large. The interested reader can obtain more on the subgradient method 
from [10,75]. 

Example 8.6 (The perceptron algorithm). Recall the hinge loss function with p — 0, defined in (8.37), 

C{y, 0 1 x ) = max (0, —yO 7 jc). 

In a two-class classification task, we are given a set of training samples, (y n , x n ) e {— 1, +1} x 
R /+l , n — 1,2,.... A', and the goal is to compute a linear classifier to minimize the empirical loss 
function 

N 

J (0) = '^2C(y n ,O T x„). ( 8 . 56 ) 

n =1 

We will assume the classes to be linearly sepcirable, which guarantees that there is a solution; that 
is, there exists a hyperplane that classifies correctly ali data points. Obviously, such a hyperplane will 
score a zero for the cost function in (8.56). We have assumed that the dimension of our input data space 
has been increased by one, to account for the bias term for hyperplanes not Crossing the origin. 

The subdifferential of the hinge loss function is easily checked (e.g., use geometric arguments 
which relate a subgradient with a support hyperplane of the respective function graph) to be 


dC(y„,0 T x„) 


0, y n 0 T x n > 0, 

- y„x n , y n e T x n < 0, 

g e [-y„x„,0], y n 0 T x n = 0. 


We choose to work with the following subgradient 

^ ( yn : ^ %n) — ynXn X(—oo,0] -Ri)! 


where xa( t) is the characteristic function, defined as 


XA(r) 


1, r e A, 
0, r i A. 


The subgradient algorithm now becomes, 


N 

0 (,) = 0 { '~ l) +pi y^y n x n X(-oo,a\{y n 6 u ~ UI x»)- 

n =1 


(8.57) 


(8.58) 


(8.59) 


(8.60) 
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This is the celebrated perceptron algorithm, which we are going to see in more detail in Chapter 18. 
Basically, what the algorithm in (8.60) says is the following. Starting from an arbitrary 0 {O \ test ali 
training vectors with 6 {l ~ l> . Select all those vectors that fail to predict the correct class (for the correct 
class, y n 0 { '~ l)T x n > 0), and update the current estimate toward the direction of the weighted (by the 
corresponding label) average of the misclassified patterns. It turns out that the algorithm converges in 
a finite number of steps, even if the step size sequence is not a diminishing one. This is what was said 
before, i.e., in certain cases, convergence of the subgradient algorithm is guaranteed even if some of 
the assumptions in Proposition 8.3 do not hold. 

The Generic Projected Subgradient Scheme 

The generic scheme on which a number of variants draw their origin is summarized as follows. Select 
0 {Q) e R 7 arbitrarily. Then the iterative scheme 


0 {i) = P c {o (i ~ X) - HiJ : GPS scheme, 


(8.61) 


where J' denotes a respective subgradient and Pq is the projection operator onto C, converges (con¬ 
verges weakly in the more general case) to a solution of the constrained task in (8.50). The sequence of 
nonnegative real numbers, //;, is judicially selected. It is readily seen that this scheme is a generaliza- 
tion of the gradient descent scheme, discussed in Chapter 5, if we set C = R 7 and J is differentiable. 


The Projected Gradient Method (PGM) 

This method is a special case of (8.61) if J is smooth and we set pi = p. That is, 

0 {i) = P c {o {i ~ X) - /zV/(0 (7_1) )^ : PGM scheme. 

It turns out that if the gradient is y-Lipschitz, that is, 

||V/(0) — V/(/i)|| < y\\0 — h\\, y > 0, V0, /i eR 7 , 


(8.62) 


and 


p e 0, 


then starting from an arbitrary point 0 iO) , the sequence in (8.62) converges (weakly in a general Hilbert 
space) to a solution of (8.50) [41,52]. 

Example 8.7 (Projected Landweber method). Let our optimization task be 


1 

minimize — II v 
2 


■ xe\\ 


subject to 0 e C, 

where X e R mx7 , y e R" ! . Expanding and taking the gradient, we get 

J{0) = l -O T X T X0 - y T X0 + l ) y T y, 
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X J(0) = X T X0 -X T y. 


First we check that V J(0) is y-Lipschitz. To this end, we have 


\\X T X(0-h)\\ < \\X T X\\\\0 -h\\ <k max ||0 ~h\\, 

where the spectral norm of a matrix has been used (Section 6.4) and k max denotes the maximum 
eigenvalue X T X. Thus, if 


/X G 



the corresponding iterations in (8.62) converge to a solution of (8.50). The scheme has been used in 
the context of compressed sensing where (as we will see in Chapter 10) the task of interest is 


• • • o 

mimmize - v — X0\\ 
2 J 

subject to ||0||i<p. 


Then it turns out that projecting on the l\ ball (corresponding to C) is equivalent to a soft thresholding 
operation [35]. A variant occurs if a projection on a weighted l \ ball is used to speed up conver- 
gence (Chapter 10). Projection on a weighted l\ ball has been developed in [47], via fully geometric 
arguments, and it also results in soft thresholding operations. 

Projected Subgradient Method 

Starting from an arbitrary point 0 iO) , for the recursion [2,55] 


0 (i) = P C (V , '“ 1) 


max {1, 


Mi 

y'(0 (, '- n )||} 



PSMa, 


(8.63) 


• either a solution of (8.50) is achieved in a finite number of steps, 

• or the iterations converge (weakly in the general case) to a point in the set of Solutions of (8.50), 
provided that 


OO OO 

1X1=0 °, <°°- 

/=i /=i 

Another version of the projected subgradient algorithm was presented in [70]. Let 7* = min# J{0) be 
the minimum (strictly speaking the infimum) of a cost function, whose set of minimizers is assumed to 


See Chapter 10 and Example 8.10. 
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be nonempty. Then the iterative algorithm 



PSMb 


(8.64) 




if J'(0 {i ~ l) ) = 0, 


converges (weakly in infinite-dimensional spaces) for m e (0, 2), under some general conditions, and 
assuming that the subgradient is bounded. The proof is a bit technical and the interested reader can 
obtain it from, for example, [70,82]. 

Needless to say that, besides the previously reported major schemes discussed, there is a number of 
variants; for a related review see [82]. 


8.10.3 ONLINE LEARNING FOR CONVEX 0PTIMIZATI0N 

Online learning in the framework of the squared error loss function has been the focus in Chapters 5 


and 6. One of the reasons that Online learning was introduced was to give the potential to the algorithm 
to track time variations in the underlying statistics. Another reason was to cope with the unknown 
statistics when the cost function involved expectations, in the context of the stochastic approximation 
theory. Moreover, Online algorithms are of particular interest when the number of the available training 
data and the dimensionality of the input space become very large, compared to the load that today’s 
storage, processing, and networking devices can cope with. Exchanging information has now become 
cheap and databases have been populated with a massive number of data. This has rendered batch 
processing techniques for learning tasks with huge data sets impractical. Online algorithms that process 
one data point at a time have now become an indispensable algorithmic tool. 

Recall from Section 3.14 that the ultimate goal of a machine learning task, given a loss function £, 
is to minimize the expected loss/risk, which in the context of parametric modeling can be written as 


/ (0) — E [£(y, /e(x))] 
:=E[£(fl,y,x)]. 


(8.65) 


Instead, the corresponding empirical risk function is minimized, given a set of N training points, 



( 8 . 66 ) 


n =1 


In this context, the subgradient scheme would take the form 


N 



n= 1 


where for notational simplicity we used 


C„(0) :=C(0,y n ,x n ). 


( 8 . 67 ) 
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Thus, at each iteration, one has to compute N subgradient values, which for large values of N is com- 
putationally cumbersome. One way out is to adopt stochastic approximation arguments, as explained 
in Chapter 5, and come with a corresponding Online version, 


0 n =0 n - l -n n C' n (0 n - 1), (8.68) 

where now the iteration index, i, coincides with the time index, n. There are two different ways to view 
(8.68). Either n takes values in the interval [1, N] and one cycles periodically until convergence, or n 
is left to grow unbounded. The latter is very natural for very large values of N, and we focus on this 
scenario from now on. Moreover, such strategy can cope with slow time variations, if this is the case. 
Note that in the Online formulation, at each time instant a different loss function is involved and our task 
becomes that of an asymptotic minimization. Furthermore, one has to study the asymptotic convergence 
properties, as well as the respected convergence conditions. Soon, we are going to introduce a relatively 
recent tool for analyzing the performance of Online algorithms, namely, regret analysis. 

It turns out that for each one of the optimization schemes discussed in Section 8.10.2, we can write 
its Online version. Given the sequence of loss functions C n , n — 1,2..., the Online version of the 
generic projected subgradient scheme becomes 

0 n = Pc{ 6 n - 1 - PnC' n (0 n -i)), n= 1,2,3,.... (8.69) 

In a more general setting, the constraint-related convex sets can be left to be time-varying too; in 
other words, we can write C n . For example, such schemes with time-varying constraints have been 
developed in the context of sparsity-aware learning, where in place of the l\ ball, a weighted l\ ball 
is used [47]. This has a really drastic effect in speeding up the convergence of the algorithm (see 
Chapter 10). 

Another example is the so-called adoptive gradient (AdaGrad) algorithm [38]. The projection op¬ 
erator is defined in a more general context, in terms of the Mahalanobis distance, that is, 

p9 (x) — min(jc — z) T G(x — z), VxeM 1 . (8.70) 

zeC 

In place of G, the square root of the sum of the outer products of the computed subgradients is used, 
that is, 

G n = J2SkgL 

k= 1 

where g k = C' k (0k-i) denotes the subgradient at time instant k. Also, the same matrix is used to weigh 
the gradient correction and the scheme has the form 

On = Pc " 2 (o n -i - /x„G,7 1/2 g„) . (8.71) 

The use of the (time-varying) weighting matrix accounts for the geometry of the data observed in earlier 
iterations, which leads to a more informative gradient-based learning. For the sake of computational 
savings, the structure of G„ is taken to be diagonal. Different algorithmic settings are discussed in [38], 
alongside the study of the converging properties of the algorithm; see also Section 18.4.2. 
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Example 8.8 (The LMS algorithm). Let us assume that 

£„(#)= (>« -0' x n ) 2 , 

and also set C = and /x„ = /i. Then (8.69) becomes our familiar LMS recursion, 

On — On— 1 "E /■ {y>i 0 n _iXn)x n , 

whose convergence properties have been discussed in Chapter 5. 


The PEGASOS Algorithm 

The primal estimated subgradient solver for SVM (PEGASOS) algorithm is an Online scheme built 
around the hinge loss function regularized by the squared Euclidean norm of the parameters vector 
[73]. From this point of view, it is an instance of the Online version of the projected subgradient algo¬ 
rithm. This algorithm results if in (8.69) we set 

Cnm = max (0, 1 - y n 0 T x n ) + ^11011 2 , (8.72) 

where in this case, p in the hinge loss function has been set equal to one. The associated empirical cost 
function is 

1 N X 

J (6) = - £max(0, 1 - y n 0 T x n ) + -\\0\\ 2 , (8.73) 

n= 1 

whose minimization results in the celebrated support vector machine. Note that the only differences 
with the perceptron algorithm are the presence of the regularizer and the nonzero value of p. These 
seemingly minor differences have important implications in practice, and we are going to say more on 
this in Chapter 1 1 , where nonlinear extensions treated in the more general context of Hilbert spaces 
will be considered. 

The subgradient adopted by the PEGASOS is 

£' n (0) = X0 - y n x n X(-oo,0}(y n O T *n ~ I)- (8.74) 


The step size is chosen as p n = 
an (optional) projection on the 
becomes 


_j_ 

Xn ' 

1 

yx 


Furthermore, in its more general formulation, at each iteration step, 
length ti ball, />’|0. i], is performed. The update recursion then 


On = P. 


B[0, 


px 


((1 dnX)O n -l + p, n y n X n X( —00,0] (ynO^Xn — 1)), (8.75) 


where P 


B[0, 


Vx J 


is the projection on the respective I 2 ball given in (8.14). In (8.75), note that the 

effect of the regularization is to smooth out the contribution of 0 n -\. A variant of the algorithm for 
a fixed number of points, N, suggests to average out a number of m subgradient values in an index 
set, A„ C {1,2,..., N }, such that k e A n if y/ c 0^_ l jc& < 1. Different scenarios for the choice of the 
m indices can be employed, with the random one being a possibility. The scheme is summarized in 
Algorithm 8.4. 
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Algorithm 8.4 (The PEGASOS algorithm). 

• Initialization 

- Select 0 (O) ; Usually set to zero. 

- Select X 

- Select m; Number of subgradient values to be averaged. 

• For n = 1, 2,..., N, Do 

- Select A n C. {1,2,..., N}: (cardinality) \A n \ =m, uniformly at random. 

- = ia 

- 0„ = (l — n„x)0 n - 1 + ^ 12keA„ yk x k 

- 0n = 0’ TTifeii) ^ ;0ptionaL 

• End For 

Application of regret analysis arguments point out the required number of iterations for obtaining 
a solution of accuracy e is (9(1 /c), when each iteration operates on a single training sample. The 
algorithm is very similar with the algorithms proposed in [45,112]. The difference lies in the choice 
of the step size. We will come back to these algorithms in Chapter 11. There, we are going to see that 
the online learning in infinite-dimensional spaces is more tricky. In [73], a number of comparative tests 
against well-established support vector machine algorithms have been performed using Standard data 
sets. The main advantage of the algorithm is its computational simplicity, and it achieves comparable 
performance rates at lower computational costs. 


8.11 REGRET ANALYSIS 

A major effort when dealing with iterative learning algorithms is dedicated to the issue of convergence; 
where the algorithm converges, under which conditions it converges, and how fast it converges to its 
steady-state. A large part of Chapter 5 was focused on the convergence properties of the LMS. Further- 
more, in the current chapter, when we discussed the various subgradient-based algorithms, convergence 
properties were also reported. 

In general, analyzing the convergence properties of online algorithms tends to be quite a formidable 
task and classical approaches have to adopt a number of assumptions, sometimes rather strong. Typical 
assumptions refer to the statistical nature of the data (e.g., being i.i.d. or the noise being white). In 
addition, it may be assumed that the true model, which generates the data, to be known, and/or that the 
algorithm has reached a region in the parameter’s space that is close to a minimizer. 

More recently, an alternative methodology has been developed which bypasses the need for such 
assumptions. The methodology evolves around the concept of cumulative loss, which has already been 
introduced in Chapter 5, Section 5.5.2. The method is known as regret analysis, and its birth is due to 
developments in the interplay between game and learning theories (see, for example, [21]). 

Let us assume that the training samples ( y n , x n ), n — 1,2,..., arrive sequentially and that an 
adopted online algorithm makes the corresponding predictions, y„. The quality of the prediction for 
each time instant is tested against a loss function, C(y n , y n ). The cumulative loss up to time N is given 
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by 

N 

£cum(AQ := y^£(y n , y„). (8.76) 

n =1 

Let / be a fixed predictor. Then the regret of the Online algorithm relative to /, when running up to 
time instant N, is defined as 


N N 

Regret N (f) := ^ C(y„, y„) - ^ C (; y„, f(x „)) : regret relative to /. 

n =1 «=1 


(8.77) 


The name regret is inherited from the game theory and it means how “sorry” the algorithm or the 
learner (in the machine learning jargon) is, in retrospect, not to have followed the prediction of the 
fixed predictor, /. The predictor / is also known as the hypothesis. Also, if / is chosen from a set of 
functions, T, this set is called the hypothesis class. 

The regret relative to the family of functions T, when the algorithm runs over N time instants, is 
defined as 


Regret^CF) := maxRegret N (f). (8.78) 

feT 

In the context of regret analysis, the goal becomes that of designing an Online learning rule so that 
the resulting regret with respect to an optimal fixed predictor is small; that is, the regret associated with 
the learner should grow sublinearly (slower than linearly) with the number of iterations, N. Sublinear 
growth guarantees that the difference between the average loss suffered by the learner and the average 
loss of the optimal predictor will tend to zero asymptotically. 

For the linear class of functions, we have 

y n — 0 n _\X n , 


and the loss can be written as 


£(y n ,yn) = £(>'„, 9 T n _ x x n ) := C n (0 n - 1 ). 

Adapting (8.77) to the previous notation, we can write 

N N 

Regret N (h) = ^£„(0„-i) - ^ C n (h ), (8.79) 

71=1 71=1 

where /i e C c R ( is & fixed parameter vector in the set C where Solutions are sought. 

Before proceeding, it is interesting to note that the cumulative loss is based on the loss suffered 
by the learner, against y„, x n , using the estimate #„-i, which has been trained on data up to and 
including time instant n — 1. The pair (y n , x„ ) is not involved in its training. From this point of view, 
the cumulative loss is in line with our desire to guard against overfitting. 

In the framework of regret analysis, the path to follow is to derive an upper bound for the regret, 
exploiting the convexity of the employed loss function. We will demonstrate the technique via a case 
study, i.e., that of the online version of the simple subgradient algorithm. 
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REGRET ANALYSIS OF THE SUBGRADIENT ALGORITHM 

The Online version of (8.68), for minimizing the expected loss, E [C(0, y, x)], is written as 

On = 0 n -l ~ llng n , (8-80) 

where for notational convenience the subgradient is denoted as 

Sn \=C' n (9 n -x). 

Proposition 8.4. Assume that the subgradients ofthe loss function are bounded, shown as 

llg„ll<G,Vn. ( 8 . 81 ) 

Furthermore, assume that the set of Solutions S is bounded; that is, V 6 , h e S, there exists a bound F 
such that 


10 - AII < F. 


Let 0* be an optimal (desired) predictor. Then, if /i n 



(8.82) 


N N ^—' 



Vn' 


(8.83) 


In words, as N —> oo the average cumulative loss tends to the average loss ofthe optimal predictor. 

Proof. Since the adopted loss function is assumed to be convex and by the definition of the subgradient, 
we have 

£„(A)>£n(0„-i) + gJ(A-0„_i), V/tel', (8.84) 

or 

(0,.-! )-C n (h) <g T n (B n -i -h). (8.85) 

However, recalling (8.80), we can write 

0„-A = 0„_i -h-p. n g n , (8.86) 

which results in 

||0„-A|| 2 = \\0 n -i-h\\ 2 + u 2 n \\g n \\ 2 

-2^ n gfi(0n-i-h). (8.87) 


Taking into account the bound of the subgradient, Eq. (8.87) leads to the inequality 


g T n (0 n - 1 - h) < ^{\\0n-l ~ h\\ 2 - ||0„ - A|| 2 ) + ^G 2 . 


( 8 . 88 ) 
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Summing up both sides of (8.88), taking into account inequality (8.85) and after a bit of algebra (Prob- 
lem 8.30), results in 


N N i g 2 N 

Yc n (0 n - 1 ) - V £„{h) < -— F 2 + — Y^n- ( 8 - 89 ) 

2LJL ai 2 


Setting = 0=, using 


the obvious bound 


1 f N 1 

V —<1+/ — dr = 2v / W-l, (8.90) 

^ J i VF 


and dividing both sides of (8.89) by N, the proposition is proved for any h. Hence, it will also be true 
for0*. □ 

The previous proof follows the one given in [113]; this was the first paper to adopt the notion of 
“regret” for the analysis of convex online algorithms. Proofs given later for more complex algorithms 
have borrowed, in one way or another, the arguments used there. 

Remarks 8.8. 


• Tighter regret bounds can be derived when the loss function is strongly convex [43]. A function 

/ : X C i—R. is said to be a-strongly convex if, V y, x e X, 

f(y) > /(x) + g T (y - x) + °\\y - *|| 2 , (8.91) 

for any subgradient g at x. It also turns out that a function /( x) is strongly convex if f (x) — % \ |x11 2 
is convex (Problem 8.31). 

For a-strongly convex loss functions, if the step size of the subgradient algorithm is diminishing 
at a rate O(^), then the average cumulative loss is approaching the average loss of the optimal pre- 
dictor at a rate (Problem 8.32). This is the case, for example, for the PEGASOS algorithm, 

discussed in Section 8.10.3. 

• In [4,5], 0(1/N) convergence rates are derived for a set of not strongly convex smooth loss func¬ 
tions (squared error and logistic regression) even for the case of constant step sizes. The analysis 
method follows statistical arguments. 


8.12 ONLINE LEARNING AND BIG DATA APPLICATIONS: A DISCUSSION 

Online learning algorithms have been treated in Chapters 4, 5, and 6. The purpose of this section 
is to first summarize some of the hndings and at the same time present a discussion related to the 
performance of online schemes compared to their batch relatives. 

Recall that the ultimate goal in obtaining a parametric predictor, 


9= fe(x), 
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is to select 0 so as to optimize the expected loss/risk function (8.65). For practical reasons, the cor- 
responding empirical formulation in (8.66) is most often adopted instead. From the learning theory’s 
point of view, this is justified provided the respective class of functions is sufficiently restrictive [94]. 
The available literature is quite rich in obtaining performance bounds that measure how close the op- 
timal value obtained via the expected loss is to that obtained via the empirical one, as a function of 
the number of points N. Note that as N —> oo, and recalling well-known arguments from probability 
theory and statistics, the empirical risk tends to the expected loss (under general assumptions). Thus, 
for very large training data sets, adopting the empirical risk may not be that different from using the 
expected loss. However, for data sets of shorter lengths, a number of issues occur. Besides the value 
of N, another critical factor enters the scene; this is the complexity of the family of the functions in 
which we search a solution. In other words, the generalization performance critically depends not only 
on N but also on how large or small this set of functions is. A related discussion for the specific case 
of the MSE was presented in Chapter 3, in the context of the bias-variance tradeoff. The roots of the 
more general theory go back to the pioneering work of Vapnik-Chervonenkis; see [3 1,95,96], and [88] 
for a less mathematical summary of the major points. 

In the sequel, some results tailored to the needs of our current discussion will be discussed. 

APPROXIMATION, ESTIMATION, AND OPTIMIZATION ERRORS 

Recall that all we are given in a machine learning task is the available training set of examples. To 
set up the “game,” the designer has to decide on the selection of (a) the loss function, £(•, •), which 
measures the deviation (error) between predicted and target values, and (b) the set F of (parametric) 
functions, 

F=[fe - OeR K }. 

Based on the choice of £(•, •), the benchmark function, denoted as /*, is the one that minimizes the 
expected loss (see also Chapter 3), that is, 

/* = argmmE [£(y, /(x))], 

or equivalently 

f*(x) — argminE [C (y, y) |x]. (8.92) 

y 

Let also fo t denote the optimal function that results by minimizing the expected loss constrained within 
the parametric family F, that is, 


f» t : 0* = argmmE [£(y, /o(x))]. 

v 


(8.93) 


However, instead of fo t , we obtain another function, denoted as /,.y, by minimizing the empirical risk, 
Jn(0), 


In(x) := fe,(N)(x) : 0*(N) = argmin J N (0). 


(8.94) 
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Once fy has been obtained, we are interested in evaluating its generalization performance; that is, to 
compute the value of the expected loss at fy, E[£(y, /jv(x))]. The excess error with respect to the 
optimal value can then be decomposed as [13] 

£ = E [£(;y, /aTx))] - E [£(y, /*(x))] = £ apP r + £ est , (8.95) 

where 

^appr ■ — E [£(y, fyjx))] - E [£(y, /*(x))] : approximation error, 
and 

£ est := e [£(y, /jv(x))] - E [£(y, /j t (x))] : estimation error, 

where £ appr is known as the approximation error and £ e st is known as the estimation error. The former 
measures how well the chosen family of functions can perform compared to the optimal/benchmark 
value and the latter measures the performance loss within the family J 7 , due to the fact that optimization 
is performed via the empirical risk function. Large families of functions lead to low approximation error 
but higher estimation error, and vice versa. A way to improve upon the estimation error, while keeping 
the approximation error small, is to increase N. The size/complexity of the family T is measured by 
its capacity, which may depend on the number of parameters, but this is not always the whole story; 
see, for example, [88,95]. For example, the use of regularization, while minimizing the empirical risk, 
can have a decisive effect on the approximation-estimation error tradeoff. 

In practice, while optimizing the (regularized) empirical risk, one has to adopt an iterative min- 
imization or an Online algorithm, which leads to an approximate solution, denoted as fy. Then the 
excess error in (8.95) involves a third term [13,14], 

£ = fappr + <?est + <?opt, (8.96) 


where 


£ op t:=E^£(y, /jv(x)) 


E [£(y, //v(x))] : optimization error. 


The literature is rich in studies deriving bounds concerning the excess error. More detailed treatment is 
beyond the scope of this book. As a case study, we will follow the treatment given in [14]. 

Let the computation of fy be associated with a predefined accuracy, 


E[£(y,/v(x)) 


<E[£(y,/iv(x))] 


p. 


Then, for a class of functions that are often met in practice, for example, under strong convexity of the 
loss function [51] or under certain assumptions on the data distribution [92], the following equivalence 
relation can be established: 


-appr' 


•'Opt 


-appr ' 


/lnJVY 


r i i 

(-S-) +p ’ 

Cl G 

[y 1 ] 


(8.97) 


which verihes the fact that as N —»■ oo the estimation error decreases and provides a rule for the 
respective convergence rate. The excess error £, besides the approximation component, on which we 
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have no access to control (given the family of functions, J 7 ), it depends on (a) the number of data 
and (b) on the accuracy, p , associated with the algorithm used. How one can control these parameters 
depends on the type of learning task at hand. 

• Small-scale tasks: These types of tasks are constrained by the number of training points N. In this 
case, one can reduce the optimization error, since computational load is not a problem, and achieve 
the minimum possible estimation error, as this is allowed by the number of available training points. 
In this case, one achieves the approximation-estimation tradeoff. 

• Large-scale/big data tasks: These types of tasks are constrained by the computational resources. 
Thus, a computationally cheap and less accurate algorithm may end up with lower excess error, 
since it has the luxury of exploiting more data, compared to a more accurate yet computationally 
more complex algorithm, given the maximum allowed computational load. 

BATCH VERSUS ONLINE LEARNING 

Our interest in this subsection lies in investigating whether there is a performance loss if in place of a 
batch algorithm an Online one is used. There is a very subtle issue involved here, which turns out to be 
very important from a practical point of view. We will restrict our discussion to differentiable convex 
loss functions. 

Two major factors associated with the performance of an algorithm (in a stationary environment) 
are its convergence rate and its accuracy after convergence. The general form of a batch algorithm in 
minimizing (8.66) is written as 

0(0 =0<'-n - p i ^ i VJ N (0 (i ~ 1) ) 

= 0 ( '- V) - (8.98) 

n— 1 

For gradient descent, <t>,- = I, and for Newton-type recursions, <f>, is the inverse Hessian matrix of the 
loss function (Chapter 6). 

Note that these are not the only possible choices for matrix <t>. For example, in the Levenberg- 
Marquardt method, the square Jacobian is employed, that is, 

<t>, = [V/(0 ( '~ 1 ')V :r /(0 (,-1) ) + XI * , 

where X is a regularization parameter. In [3], the natural gradient is proposed, which is based on the 
Fisher information matrix associated with the noisy distribution implied by the adopted prediction 
model, In both cases, the involved matrices asymptotically behave like the Hessian, yet they 

may provide improved performance during the initial convergence phase. For a further discussion, the 
interested reader may consuit [49,56]. 

As already mentioned in Chapters 5 and 6 (Section 6.7), the convergence rate to the respective 
optimal value of the simple gradient descent method is linear , that is. 
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and the corresponding rate for a Newton-type algorithm is (approximately) quadratic, that is, 

1 

lnln- w oc i. 

||0 (O -e *(ad| 

In contrast, the Online version of (8.98), that is, 

0« = 0/!-i - /u.„d>, ! /C , (0„_i,y„,x„), (8.99) 

is based on a noisy estimation of the gradient, using the current sample point, ( y n , jc„), only. The effect 
of this is to slow down convergence, in particular when the algorithm gets close to the solution. More- 
over, the estimate of the parameter vector fluctuates around the optimal value. We have extensively 
studied this phenomenon in the case of the LMS, when /i„ is assigned a constant value. This is the 
reason that in the stochastic gradient rationale, /x„ must be a decreasing sequence. However, it must 
not decrease very fast, which is guaranteed by the condition /i n —> oo (Section 5.4). Furthermore, 
recall from our discussion there that the rate of convergence toward 0* is, on average, 0(1/«). This 
resuit also covers the more general case of online algorithms given in (8.99) (see, for example, [56]). 
Note, however, that ali these results have been derived under a number of assumptions, for example, 
that the algorithm is close enough to a solution. 

Our major interest now turns to comparing the rate at which a batch and a corresponding online 
algorithm converge to 0*, that is, the value that minimizes the expected loss, which is the ultimate goal 
of our learning task. Since the aim is to compare performances, given the same number of training 
samples, let us use the same number, both for n for the online and for N for the batch. Following [12] 
and applying a second-order Taylor expansion on J„(0), it can be shown (Problem 8.28) that 

0 *(«) = 0 *(« - 1 ) - l -V- l £'{0An - 1 ),y H ,x n ), ( 8 . 100 ) 

where 

4 / „ = ( -^2 V 2 £(0*(n - l),yk,Xk) 

\ k=l 

Note that (8.100) is similar in structure to (8.99). Also, as n —> oo, converges to the Hessian 

matrix, H, of the expected loss function. Hence, for appropriate choices of the involved weighting 
matrices and setting /i„ = 1 /n, (8.99) and (8.100) can converge to 0* at similar rates; thus, in both 
cases, the critical factor that determines how close to the optimal 0* the resulting estimates are is the 
number of data points used. It can be shown [12,56,93] that 


E[lie„ -0*ll 2 ] + e>Q^ =E[ne*(«) —0 *m 2 ] + 0Q^ = 


where C is a constant depending on the specific form of the associated expected loss function used. 
Thus, batch algorithms and their online versions can be made to converge at similar rates to 0*. 
after appropriate fine-tuning of the involved parameters. Once more, since the critical factor in big 





404 CHAPTER 8 PARAMETER LEARNING: A CONVEX ANALYTIC PATH 


data applications is not data but computational resources, a cheap Online algorithm can achieve en- 
hanced performance (lower excess error) compared to a batch, yet being a computationally more thirsty 
scheme. This is because for a given computational load, the Online algorithm can process more data 
points (Problem 8.33). More importantly, an Online algorithm needs not to store the data, which can 
be processed on the fly as they arrive. For a more detailed treatment of the topic, the interested reader 
may consuit [14]. 

In [14], two forms of batch linear support vector machines (Chapter 11) were tested against their 
Online stochastic gradient counterparts. The tests were carried out on the RCV1 data basis [53], and the 
training set comprised 781.265 documents represented by (relatively) sparse feature vectors consisting 
of 47.152 feature values. The stochastic gradient online versions, appropriately tuned with a diminish- 
ing step size, achieved comparable error rates at substantially lower computational times (less than one 
tenth) compared to their batch processing relatives. 

Remarks 8.9. 

• Most of our discussion on the online versions has been focused on the simplest version, given in 
(8.99) for <t>„ — I. However, the topic of stochastic gradient descent schemes, especially in the 
context of smooth loss functions, has a very rich history of over 60 years, and many algorithmic 
variants have been “born.” In Chapter 5, a number of variations of the basic LMS scheme were 
discussed. Some more notable examples which are stili popular are the following. 

Stochastic gradient descent with momentum : The basic iteration of this variant is 

0 n = On -1 - HnC' n {0„- 1 ) + & (0„_ 1 - 0 n - 2 ). (8.101) 

Very often, f} n — fi is chosen to be a constant (see, for example, [91]). 

Gradient averaging: Another widely used version results if the place of the single gradient is taken 
by an average estimate, that is, 


0„ = 0 n -1 - — £4(0„_i). (8.102) 

n k= 1 

Variants with different averaging scenarios (e.g., random selection instead of using all previously 
points) are also around. Such an averaging has a smoothing effect on the convergence of the 
algorithm. We have already seen this rationale in the context of the PEGASOS algorithm (Sec- 
tion 8.10.3). The general trend of all the variants of the basic stochastic gradient scheme is to 
improve with respect to the involved constants, but the convergence rate stili remains 0 (l/n). 

In [50], the online learning rationale was used in the context of data sets of fixed size, N. Instead 
of using the gradient descent scheme in (8.98), the following version is proposed: 

0(O = 0a-t)_ (8.103) 

V k= 1 


g (0 ( 4 (» fi-1) ). = ^ 

| otherwise. 


(8.104) 


where 
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The index 4 is randomly chosen every time from {1,2,..., /V). Thus, in each iteration only one 
gradient is computed and the rest are drawn from the memory. It turns out that for strongly convex 
smooth loss functions, the algorithm exhibits linear convergence to the solution of the empirical 
risk in (8.56). Of course, compared to the basic Online schemes, an 0(N) memory is required for 
keeping track of the gradient computations. 

• The literature on deriving performance bounds concerning Online algorithms is very rich, both in 
studies and in ideas. For example, another line of research involves bounds for arbitrary online 
algorithms (see [1,19,69] and references therein). 


8.13 PROXIMALOPERATORS 

So far in the chapter, we have dealt with the notion of the projection operator. In this section, we go 
one step further and we will introduce an elegant generalization of the notion of projection. Note that 
when we refer to an operator, we mean a mapping from R 7 1 —> 1R/, in contrast to a function, which is 
a mapping R 7 1 —> M. 


Definition 8.9. Let 


/:1 


R 


be a convex function and X > 0. The corresponding proximal or proximity operator of index X [60,71], 


Prox;,/ 


(8.105) 


is defined as 


Proxx/(jc) := arg min 

reR 7 


/00 



proximal operator. 


(8.106) 


We stress that the proximal operator is a point in R 7 . The definition can also be extended to include 
functions defined as / : R 7 1 —> R U [+oo]. A closely related notion to the proximal operator is the 
following. 

Definition 8.10. Let / be a convex function as before. We call the Moreau envelope the function 


exfix) := min 
ueR' 


f(v)+ 2X 



Moreau envelope. 


(8.107) 


Note that the Moreau envelope [59] is a function related to the proximal operator as 

eif(x) = f (ProxA/(*)) + ^-||x - Prox A /(x)|| 2 . (8.108) 

The Moreau envelope can also be thought of as a regularized minimization, and it is also known as the 
Moreau-Yosida regularization [109]. Moreover, it can be shown that it is differentiable (e.g., [7]). 
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A first point to clarify is whether the minimum in (8.106) exists. Note that the two terms in the 
brackets are both convex; namely, f(v) and the quadratic term x — v \| 2 . Hence, as can easily be 
shown by recalling the definition of convexity, their sum is also convex. Moreover, the latter of the two 
terms is strictly convex, hence their sum is also strictly convex, which guarantees a unique minimum. 

Example 8.9. Let us calculate Prox; I(: , where tc : R/ i —> M U {+oo} stands for the indicator function 
of a nonempty closed convex subset CcK f , dehned as 


ic(x) := 


0, ifxeC, 
+ oo, if x C. 


It is not difficult to verify that 


Proxx ic (x) = arg min 

rsR 7 


iciV) + — ||x - l>|| 


= arg min ||x — i>|| 2 = Pc(x), Vrel', VA > 0, 
veC 


where Pc is the (metric) projection mapping onto C. 
Moreover, 


exi c (x) = min 

usR' 


tc(«) + — II* - f|| 


1 9 1 2 

= min — lix — v\\ = — dr(x), 
ve C 2X 2X c 

where dc stands for the (metric) distance function to C (Example 8.5), dehned as dc(x):= 

min„ s c II* — i>|| - 

Thus, as said in the beginning of this section, the proximal operator can be considered as the gen- 
eralization of the projection one. 

Example 8.10. In the case where / becomes the t\ norm of a vector, that is, 


||x||i = 5>l. VxeR / , 


i=i 


it is easily determined that (8.106) decomposes into a set of / scalar minimization tasks, that is, 


Proxjq|.|| 1 (x)|; = arg min 

V; gM 


|u/l + — (Xi - ViY 


i = 1,2,...,/, 


(8.109) 


where Prox^y-Hj (x)|/ denotes the respective ith element. Minimizing (8.109) is equivalent to requiring 
the subgradient to be zero, which results in 
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ProxxiHIjOOli 


Xi — sgn (xj)X, if | Xi | > X, 
0, if | Xi | < X 


= sgn(v,) max{0, \xt \ — 1}. 


( 8 . 110 ) 


For the time being, the proof is left as an exercise. The same task is treated in detail in Chapter 9, and 
the proof is provided in Section 9.3. The operation in (8.110) is also known as soft thresholding. In 
other words, it sets to zero ali values with magnitude less than a threshold value (X) and adds a constant 
bias (depending on the sign) to the rest. To provoke the unfamiliar reader a bit, this is a way to impose 
sparsity on a parameter vector. 

Having calculated Prox^||.|| t (jc), the Moreau envelope of ||• || 1 can be directly obtained by 
1 ( \ 

0.||||i (*) = ( Xi ~ ProxA| i .|| 1 (x)|,-) 2 + |Prox^||.|| 1 (x)| ! - 

i =1 ' 

1 ( x 2 ( X 

= ( X[<U](l*;l)^ + X(A,+oo)(l*;l) (\xi - sgn(x ; )A.| + - 

( X i / X 

= 2^ I X[(U](|x;|)— + X(A,+oo)(ki|) I \xi\ - - 

where x^t(0 denotes the characteristic function of the set A, defined in (8.59). For the one-dimensional 
case, Z = 1, the previous Moreau envelope boils down to 




ek\.\(x) = 



2X ’ 


if |jc| > X, 
if |x| < X. 


This envelope and the original | ■ | functions are depicted in Fig. 8.30. It is worth noting here that e\\.\ 
is a scaled version, more accurately 1 /X times, of the celebrated Huber function; a loss function vastly 
used against outliers in robust statistics, which will be discussed in more detail in Chapter 1 1 . Note 
that the Moreau envelope is a “blown-up” smoothed version of the l\ norm function and although the 
original function is not differentiable, its Moreau envelope is continuously differentiable; moreover, 
they both share the same minimum. This is most interesting and we will come back to this very soon. 


8.13.1 PR0PERTIES 0F THE PR0XIMAL OPERATOR 

We now focus on some basic properties of the proximal operator, which will soon be used to give birth 
to a new class of algorithms for the minimization of nonsmooth convex loss functions. 

Propositiori 8.5. Consider a convex function 

f :M! \— >IU {+00}, 

and let Prox^f (■) be its corresponding proximal operator of index X. Then 

p = Prox A/ (x) 
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FIGURE 8.30 

The |x| function (black solid line), its Moreau envelope ex\.\ (x) (red solid line), and x 2 /2 (black dotted line), for 
x e M. Even if | • | is nondifferentiable at 0, e\\.\(x) is everywhere differentiable. Note also that althoughx 2 /2 and 
e\\.\ (x) behave exactly the same for small values of x, ex|.|(x) is more conservative than x 2 /2 in penalizing large 
values of x; this is the reason for the extensive usability of the Huber function, a scaled-down version of ex\.\ (x), as 
a robust tool against outliers in robust statistics. 


if and only if 


{y - p,x - p) < X(f(y) - /(/>)), WyeR'. 


Another necessary condition is 


( 8 . 111 ) 


Prox A/ (*) - Prox^/lj) || < {x - y, Prox^(x) - Prox A/ (j)). 


( 8 . 112 ) 


The proofs of (8.111) and (8.112) are given in Problems 8.34 and 8.35, respectively. Note that 
(8.112) is of the same flavor as (8.16), which is inherited to the proximal operator from its more 
primitive ancestor. In the sequel, we will make use of these properties to touch upon the algorithmic 
front, where our main interest lies. 


Lemma 8.3. Consider the convex function 

/:K 7 1—>IU {+oo} 


and its proximal operator Prox^y of index X. Then thefixed point set ofthe proximal operator coincides 
with the set of minimizers of f, that is, 


Fix(Prox^) = : x = argmin /'(>’) J. 


(8.113) 
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Proof. The definition of the fixed point set has been given in Section 8.3.1. We first assume that a point 
x belongs to the fixed point set, hence the action of the proximal operator leaves it unaffected, that is, 


x — Proxx/(jc), 


and making use of (8.1 1 1), we get 

{y-x,x-x)<k(f(y)~ /(x)), VyeR 7 , (8.114) 

which results in 

/«</ 00 , VyeR 7 . (8.115) 

That is, x is a minimizer of /. For the converse, we assume that x is a minimizer. Then (8. 1 15) is valid, 
from which (8.114) is deduced, and since this is a necessary and sufficient condition for a point to be 
equal to the value of the proximal operator, we have proved the claim. □ 

Finally, note that / and e X f share the same set of minimizers. Thus, one can minimize a nonsmooth 
convex function by dealing with an equivalent smooth one. From a practical point of view, the value of 
the method depends on how easy it is to obtain the proximal operator. For example, we have already 
seen that if the goal is to minimize the l\ norm, the proximal operator is a simple soft thresholding 
operation. Needless to say that life is not always that generous! 


8.13.2 PROXIMAL MINIMIZATION 

In this section, we will exploit our experience from Section 8.4 to develop iterative schemes which 
asymptotically land their estimates in the fixed point set of the respective operator. Ali that is required 
is for the operator to own a nonexpansiveness property. 

Proposition 8.6. The proximal operator associated with a convex function is nonexpansive, that is, 


||Prox A/ (x)-Prox A/ (y)|| < ||x-y||. 


(8.116) 


Proof. The proof is readily obtained as a combination of the property in (8.112) with the Cauchy- 
Schwarz inequality. Moreover, it can be shown that the relaxed version of the proximal operator (also 
known as the reflected version), 

R X f(x) :=2Prox xf (x) - I, (8.117) 

is also nonexpansive with the same fixed point set as that of the proximal operator (Problem 8.36). □ 


Proposition 8.7. Let 

f : R 7 1—» R U {+oo} 

be a convex function, with Prox^y being the respective proximal operator of index X. Then, starting 
from an arbitrary point, xo e R 7 , the iterative algorithm 


Xk = x k -i + dk (Prox A/ (x*_i) - x*_i), 


( 8 . 118 ) 
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where £ (0, 2) is such that 

oo 

y,H k (2-n k ) = +oo, 

k=\ 

converges to an element of the fixed point set ofthe proximal operator; that is, it converges to a mini- 
mizer of f. Proximal minimization algorithms are traced back to the early 1970s [57,72 ]. 

The proof of the proposition is given in Problem 8.36 [84]. Observe that (8.118) is the counterpart 
of (8.26). A special case occurs if p k = 1. which results in 


x k = Prox xf (x k -0, 


(8.119) 


also known as the proximal point algorithm. 

Example 8.11. Let us demonstrate the previous findings via the familiar optimization task of the 
quadratic function 

f(x) = -x 7 Ax — b 1 x. 

It does not take long to see that the minimizer occurs at the solution of the linear system of equations 


Ax* = b. 

From the definition in (8.106), taking the gradient of the quadratic function and equating to zero, we 
readily obtain 


Prox A /(x) = 




t+-. 


and setting e = l, the recursion in (8.119) becomes 

Xk = (A + e/) -1 (b + exk-i). 


( 8 . 120 ) 


( 8 . 121 ) 


After some simple algebraic manipulations (Problem 8.37), we finally obtain 

x k = Xk -1 + (A + e/) -1 (b - Axk-i ) - (8.122) 

This scheme is known from the numerical linear algebra as iterative refinement algorithm [58]. 
It is used when the matrix A is near-singular, so the regularization via e helps the inversion. Note 
that at each iteration, b — Ax k -1 is the error committed by the current estimate. The algorithm belongs 
to a larger family of algorithms, known as stationary iterative or iterative relaxation schemes; we will 
meet such schemes in Chapter 10. 

The interesting point here is that since the algorithm results as a special case of the proximal mini¬ 
mization algorithm, convergence to the solution is guaranteed even if e is not small! 
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Resolvent of the Subdifferential Mapping 

We will look at the proximal operator from a slightly different view, which will be useful to us soon, 
when it will be used for the solution of more general minimization tasks. We will follow a more 
descriptive and less mathematically formal path. 

According to Lemma 8.2 and since the proximal operator is a minimizer of (8.106), it must be 
chosen so that 

0e df(v) + 

A A 
or 

0 e Xdf(v) + v — x, 
or 

x e Xdf(v) + v. 

Let us now define the mapping 

(/ + Xdf): R' i—> R l , 

such that 

(I + Xdf)(v) = v + Xdf(v). (8.127) 

Note that this mapping is one-to-many, due to the definition of the subdifferential, which is a set. 
However, its inverse mapping, denoted as 

(/ + Xdf)~ l : R l i —» K 7 , (8.128) 

is single-valued, and as a matter of fact it coincides with the proximal operator; this is readily deduced 
from (8.125), which can equivalently be written as 

xe (/ + A.3/)(i>), 


(8.123) 

(8.124) 

(8.125) 

(8.126) 


which implies that 

(/ + Xdf)~ l (x) — v — Prox)f(x). (8.129) 

However, we know that the proximal operator is unique. The operator in (8.128) is known as the 
resolvent ofthe subdifferential mapping [72]. 

As an exercise, let us now apply (8.129) to the case of Example 8.11. For this case, the subdiffer¬ 
ential set is a singleton comprising the gradient vector, 

Prox;y (x) = (I + XV f)~ l (x) (I + XV f){Proxxf(x))=x, 

or by definition of the mapping ( I + XV f) and taking the gradient of the quadratic function, 

Proxx/(Jc) + kV/(Prox^/-(jc)) = Prox^Qc) + Xk Prox;y(x) — Xb = x, 


® A point-to-set mapping is also called a relation on M. 1 . 
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which finally results in 


Pro x kf (x) = ( A+ -I 


-l 


t+ i x 


8.14 PROXIMAL SPLITTING METHODS FOR OPTIMIZATION 

A number of optimization tasks often comes in the form of a summation of individual convex functions, 
some of them being differentiable and some of them nonsmooth. Sparsity-aware learning tasks are typ- 
ical examples that have received a lot of attention recently, where the regularizing term is nonsmooth, 
for example, the i\ norm. 

Our goal in this section is to solve the following minimization task: 

x* = argmin{/(x) + g(x)}, (8.130) 

where both involved functions are convex, 

f-.m 1 1 —>RU {+oo}, g\M .'\— 


and g is assumed to be differentiable while / is nonsmooth. It turns out that the iterative scheme 


x k = Pro x kkf (x k -i-X k Vg(x k _!)) 
backward step forward step 


(8.131) 


converges to a minimizer of the sum of the involved functions, that is, 

x k —* argmm{/(x) + g(x)}, (8.132) 

for a properly chosen sequence, X k , and provided that the gradient is continuous Lipschitz, that is, 


||Vg(x)-Vg(y)||<y||jr-y| 


(8.133) 


for some y > 0. It can be shown that if k k e ^ 


0 , - 1 


, then the algorithm converges to a minimizer at 
a sublinear rate, 0(1/k) [8]. This family of algorithms is known as proximal gradient or forward- 
backward splitting algorithms. The term splitting is inherited from the split of the function into two (or 
more generally into more) parts. The term proximal indicates the presence of the proximal operator of 
/ in the optimization scheme. The iteration involves an (explicit) forward gradient computation step 
performed on the smooth part and an (implicit) backward step via the use of the proximal operator 
of the nonsmooth part. The terms forward-backward are borrowed from numerical analysis methods 
involving discretization techniques [97]. Proximal gradient schemes are traced back to, for example, 
[18,54], but their spread in machine learning and signal processing matured later [26,36]. 

There are a number of variants of the previous basic scheme. A version that achieves an O(-pr) 
rate of convergence is based on the classical Nesterov modification of the gradient algorithm [65], 
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and it is summarized in Algorithm 8.5 [8]. In the algorithm, the update is split into two parts. In the 
proximal operator, one uses a smoother version of the obtained estimates, using an averaging that 
involves previous estimates. 

Algorithm 8.5 (Fast proximal gradient splitting algorithm). 

• Initialization 

- Select xo, Zi = xo, h = 1. 

- Select X. 

• For k = 1, 2,. .., Do 

- y k = Zk-XVg(z k ) 

- x k = PmXkf(y k ) 



Zk+l —T l^k (xk Xk— l) 

• End For 

Note that the algorithm involves a step size \x k - The computation of the variables t k is done in such 
a way so that convergence speed is optimized. However, it has to be noted that convergence of the 
scheme is no more guaranteed, in general. 


THE PROXIMAL FORWARD-BACKWARD SPLITTING OPERATOR 


From a first look, the iterative update given in (8.131) seems to be a bit "magic.” However, this is not 
the case and we can come to it by following simple arguments starting from the basic property of a 
minimizer. Indeed, let x* be a minimizer of (8.130). Then we know that it has to satisfy 


0 e 3/(x*) + Vg(x*), or equivalently 
0 e Xdf(x*) + XVg(x*), or equivalently 
0 e A.3/(x*) + x* - x* + A.Vg(x*), 


or equivalently 


(l-XVg)(x,)e(l + kdf)(x,), 


or 


(I + Xdf) l (/ — kVg)(x*) =x*, 


and finally 


x* = Prox x/ (/ - XVg(xj). 


(8.134) 


In other words, a minimizer of the task is a fixed point of the operator 





414 CHAPTER 8 PARAMETER LEARNING: A CONVEX ANALYTIC PATH 


(I + Xdf) 1 (/- AVg) : R'i—(8.135) 

The latter is known as the proximat forward-backward splitting operator and it can be shown that if 
X e (o, y j, where y is the Lipschitz constant, then this operator is nonexpansive [108]. This short story 
justifies the reason that the iteration in (8.131) is attracted toward the set of minimizers. 

Remarks 8.10. 

• The proximal gradient splitting algorithm can be considered as a generalization of some previously 
considered algorithms. If we set f(x ) = ic(x), the proximal operator becomes the projection op¬ 
erator and the projected gradient algorithm of (8.62) results. If f(x) = 0, we obtain the gradient 
algorithm and if g(x) = 0 the proximal point algorithm comes up. 

• Besides batch proximal splitting algorithms, online schemes have been proposed (see [36,48,106, 
107]), with an emphasis on the i\ regularization tasks. 

• The application and development of novel versions of this family of algorithms in the fields of 
machine learning and signal processing is stili an ongoing field of research and the interested reader 
can delve deeper into the field via [7,28,67,108]. 

ALTERNATING DIRECTI0N METH0D 0F MULTIPLIERS (ADMM) 

Extensions of the proximal splitting gradient algorithm for the case where both functions / and g 
are nonsmooth have also been developed, such as the Douglas-Rachford algorithm [27,54]. Here, we 
are going to focus on one of the most popular schemes, known as the alternating direction method of 
multipliers (ADMM) algorithm [40]. 

The ADMM algorithm is based on the notion of the augmented Lagrangian and at its very heart 
lies the Lagrangian duality concept (Appendix C). 

The goal is to minimize the sum f(x) + g(x), where both / and g can be nonsmooth. This equiv- 
alently can be written as 


minimize with respect to x , y f(x) + g(y), (8.136) 

subject to x — y — 0. (8.137) 

The augmented Lagrangian is defined as 

L\(x, y, z ) := f(x) + g(y) + ~fZ T ( x ~ + ^ 11 * - jllA (8.138) 

where we have denoted the corresponding Lagrange multipliers 11 by z. The previous equation can be 
rewritten as 

L k (x,y,z):=f(x) + g(y)+^-\\x-y + z\\ 2 -^-\\z\\ 2 . (8.139) 

The ADMM is given in Algorithm 8.6. 


10 In the book, we have used X for the Lagrange multipliers. However, here, we have already reserved X for the proximal 
operator. 
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Algorithm 8.6 (The ADMM algorithm). 

• Initialization 

- Fix A. > 0. 

- Select y Q , zq. 

• For k = 1,2,..., Do 

- x k = pmx kf (y k _ ] - Zk- 1 ) 

- y k = prox Xg (x k +Zk-\) 

- Zk = Zk- 1 + (x k - y k ) 

• End For 

Looking carefully at the algorithm and (8.139), observe that the first recursion corresponds to the 
minimization of the augmented Lagrangian with respect to x, keeping y and z fixed from the previous 
iteration. The second recursion corresponds to the minimization with respect to y, by keeping x and 
z frozen to their currently available estimates. The last iteration is an update of the dual variables 
(Lagrange multipliers) in the ascent direction; note that the difference in the parentheses is the gradient 
of the augmented Lagrangian with respect to z. Recall from Appendix C that the saddle point is found 
as a max-min problem of the primal (x, y) and the dual variables. The convergence of the algorithm 
has been analyzed in [40]. For related tutorial papers the reader can look in, e.g., [16,46]. 


MIRROR DESCENT ALGORITHMS 

A closely related algorithmic family to the forward-backward optimization algorithms is traced back 
to the work in [64]; it is known as mirror descent algorithms (MDAs). The method has undergone a 
number of evolutionary steps (for example, [9,66]). Our focus will be on adopting online schemes to 
minimize the regularized expected loss function 

/(0) = E[£(0,y,x)] + 0(0), 

where the regularizing function, <p, is assumed to be convex, but not necessarily smooth. In a recent 
representative of this algorithmic class, also known as regularized dual averaging (ARD) algorithm 
[100], the main iterative equation is expressed as 

0„ = min {{£!,0) +<j>(0) + /x„i/r(0)} , (8.140) 


where xfr is a strongly convex auxiliary function. For example, one possibility is to choose <pifi) — 
k| \9\ 1 1 and ir(9) = ||0| || [100]. The average subgradient of C up to and including time instant n — 1 is 
denoted by C ', that is. 


C! = 





7 = 1 


where Cj{0) := C(6, yj,xj). 
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It can be shown that if the subgradients are bounded, and /i n — O tlien following regret 

analysis arguments an O ^ -^= ^ convergence rate is achieved. If, on the other hand, the regularizing term 

is strongly convex and p, n = O then an O ^ j rate is obtained. In [100], different variants are 

proposed. One is based on Nesterov’s arguments, as used in Algorithm 8.5, which achieves an O 
convergence rate. 

A closer observation of (8.140) reveals that it can be considered as a generalization of the recursion 
given in (8.131). Indeed, let us set in (8.140) 


f(0)= l -\\0-0 n - l li 2 , 

and in place of the average gradient consider the most recent value, C' Then (8.140) becomes 
equivalent to 


0 e C' n _ { + 90(0) + /Zh(0 - 0 n - 1 ). (8.141) 

This is the same relation that would resuit from (8.131) if we set 

/ 0, Xk 0 n , Xk-i 0 n — i , g — > £, X k -> - . (8.142) 

B n 

As a matter of fact, using these substitutions and setting 0(-) = || • ||i, the FOBOS algorithm results 
[36]. However, in the case of (8. 140) one has the luxury of using other functions in place of the squared 
Euclidean distance from 0„_i. 

A popular auxiliary function that has been exploited is the Bregman divergence. The Bregman 
divergence with respect to a function, say, 0, between two points x, y is defined as 


B^ix, y) = 0(x) — 0(y) — (V0(y), x — y) : Bregman divergence. 


(8.143) 


It is left as a simple exercise to verify that the Euclidean distance results as the Bregman divergence if 

0(x) = ||x|| 2 . 

Another algorithmic variant is the so-called composite mirror descent , which employs the cur- 
rently available estimate of the subgradient, instead of the average, combined with the Bregman 
divergence; that is, £ is replaced by C' n _ { and 0(0) by Bf{0,0 n -\) for some function 0 [37]. In 
[38], a time-varying 0„ is involved by using a weighted average of the Euclidean norm, as pointed out 
in Section 8.10.3. Note that in these modifications, although they may look simple, the analysis of the 
respective algorithms can be quite hard and substantial differences can be obtained in the performance. 

At the time the current edition of the book is being compiled, this area is a hot topic of research, 
and it is stili early to draw definite conclusions concerning the relative performance benefits among 
the various schemes. It may turn out, as is often the case, that different algorithms are better suited for 
different applications and data sets. 
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8.15 DISTRIBUTED OPTIMIZATION: SOME HIGHLIGHTS 

Distributed optimization (e.g., [86]) has already been discussed in the context of the stochastic gradient 
descent with an emphasis on LMS and diffusion-type algorithms in Chapter 5 and also in Section 8.9. 
At the heart of distributed or decentralized optimization lies a cost function, 

K 

jm=Y J j km, 

k=\ 

which is defined over a connected network of K nodes (agents). Each J^,k = 1, 2,. .., K, is fed with 
the kth node’s data and quantifies the contribution of the corresponding agent to the overall cost. The 
goal is to compute a commoti, to ali nodes, optimizer, 0*. Moreover, each node communicates its 
updates, in the context of an iterative algorithm, only to its neighbors, instead of a common fusion 
center. 

The ;th step of the general iterative scheme of the so-called consensus algorithms (Section 5.13.4) 
is of the form 

e V = E + /A'V4 (»r i} ) , k=\,2,...,K, (8.144) 

meA4 

where A4 is the set of indices corresponding to the neighbors of the kth agent, that is, the nodes 
with which the kth node shares links in the corresponding graph (see Section 5.13). Matrix A — 
[«„,/,]. m, k = 1, 2,..., K, is a mixing matrix, which has to satisfy certain properties, e.g., to be left or 
doubly stochastic (Section 5.13.2). 

In the above framework, a number of algorithms have been suggested based on a diminishing step 
size sequence that converge to a solution 0 *, i.e., 

0 * = argmin J(0). 

6 

For example in [39], under the assumption of bounded gradients and a step size of /z,- = -U, a conver- 

v i 

gence rate of O(^U) is attained. Similar results hold for the so-called push method that is implemented 
on a dynamic graph [62]. Improved rates are obtained under the assumption of strong convexity [63]. 

In [85], a modification known as the EXTRA version of the basic iterative scheme, given in 
Eq. (8.144), is introduced, which adopts a constant step size, /z,- = /z, and it can achieve a conver- 
gence rate of O(j) and a linear convergence rate under the strong convexity assumption for the cost 
function. In [1 10,111], a relation for diffusion-type algorithms is developed, which also relaxes some 
of the assumptions, concerning matrix A, compared to the original NEXT algorithm. 

Distributed optimization is a topic with ongoing intense research at the time the second edition of 
book is compiled. Our goal in this section was not to present and summarize the related literature but 
more to make the reader alert of some key contributions and directions in the field. 


PROBLEMS 

8.1 Prove the Cauchy-Schwarz inequality in a general Hilbert space. 
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8.2 Show (a) that the set of points in a Hilbert space H, 

C = {x : ||jc|| < 1}, 

is a convex set, and (b) that the set of points 

C = {x : ||jc || = 1} 

is a nonconvex one. 

8.3 Show the first-order convexity condition. 

8.4 Show that a function / is convex if the one-dimensional function 

g(t ) := f(x + ty) 

is convex, Vx, y in the domain of definition of /. 

8.5 Show the second-order convexity condition. 

Hint: Show the claim first for the one-dimensional case, and then use the resuit of the previous 
problem for the generalization. 

8.6 Show that a function 

f:R l \—>R 

is convex if and only if its epigraph is convex. 

8.7 Show that if a function is convex, then its lower level set is convex for any £. 

8.8 Show that in a Hilbert space EI, the parallelogram rule 

||x + y|| 2 + ||x - y|| 2 = 2 ^||x|| 2 + ||y|| 2 ^ , Vx, y e H, 

holds true. 

8.9 Show that if x, y e H, where EI is a Hilbert space, then the inner product-induced norm satisfies 
the triangle inequality, as required by any norm, that is, 

\\x + y\\ < ||x|| + ||y||. 

8.10 Show that if a point x* is a local minimizer of a convex function, it is necessarily a global one. 
Moreover, it is the unique minimizer if the function is strictly convex. 

8.11 Let C be a closed convex set in a Hilbert space EI. Then show that Vx e EI, there exists a point, 
denoted as Pc(x) e C, such that 

||x — P c (x)|| = min ||x — y||. 
yeC 

8.12 Show that the projection of a point x e EI onto a nonempty closed convex set, C C EI, lies on the 
boundary of C. 

8.13 Derive the formula for the projection onto a hyperplane in a (real) Hilbert space EI. 

8.14 Derive the formula for the projection onto a closed ball /i|0. <$]. 

8.15 Find an example of a point whose projection on the l\ ball is not unique. 
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8.16 Show that if C C JHl is a closed convex set in a Hilbert space, then Vx e M and V_y e C, the 
projection Pc(x ) satisfies the following properties: 

• Real{(x - P c {x), y - Pc(x))} < 0, 

• ||R C «-^cOOII 2 <Real{(x-;y, P c (x) - P c (y))}- 

8.17 Prove that if S is a closed subspace S C JHl in a Hilbert space H, then Vx, y e H, 

(x, P s (y)) = {Ps(x ), y) = {P s (x ), P s {y)) 

and 

P s (ax + by) = aP s (x ) + bP s (y). 

Hint: Use the resuit of Problem 8.18. 

8.18 Let S be a closed convex subspace in a Hilbert space H, S C H. Let S 1 - be the set of all elements 
x e H which are orthogonal to S. Then show that (a) S 1 - is also a closed subspace, (b) S fi ,S' 1 = 
{0}, (c) H = S © S' _L , that is, Vx e H, 3xi e S and xj e S 1 -: x — xi + xj, where xi, X 2 are 
unique. 

8.19 Show that the relaxed projection operator is a nonexpansive mapping. 

8.20 Show that the relaxed projection operator is a strongly attractive mapping. 

8.21 Give an example of a sequence in a Hilbert space EI which converges weakly but not strongly. 

8.22 Prove that if C\ ... Ck are closed convex sets in a Hilbert space H, then the operator 

t = t Ck -t Ci 

is a regular one; that is, 

||r i_1 (x) - r"(x)|| —y 0, n —» oo, 

where T n := TT ... T is the application of T n successive times. 

8.23 Show the fundamental POCS theorem for the case of closed subspaces in a Hilbert space H. 

8.24 Derive the subdifferential of the metric distance function dc(x), where C is a closed convex set 
Ccl 1 and x e M.'. 

8.25 Derive the bound in (8.55). 

8.26 Show that if a function is y-Lipschitz, then any of its subgradients is bounded. 

8.27 Show the convergence of the generic projected subgradient algorithm in (8.61). 

8.28 Derive Eq. (8.100). 

8.29 Consider the online version of PDMb in (8.64), that is, 

e n = ( Pc (°"- ] ~ / 4 »ii/ffci 1 ) ) ip' jr,(ff "- l) )’ lf (8145) 

l Pc(0 n -i), if/'(0„_i) = 0, 

where we have assumed that 7* = 0. If this is not the case, a shift can accommodate for the 
difference. Thus, we assume that we know the minimum. For example, this is the case for a 
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number of tasks, such as the hinge loss function, assuming linearly separable classes, or the 
linear 6-insensitive loss function, for bounded noise. Assume that 


Cn(0) 


E 


cokdc k (O n - 1 ) 


k=n—q +1 


T,k=n-q+l b) kd Ck (On-l) 


d Ck (0). 


Then derive the APSM algorithm of (8.39). 

8.30 Derive the regret bound for the subgradient algorithm in (8.83). 

8.31 Show that a function f(x) is a-strongly convex if and only if the function f (x) — |-||jc|| 2 is 
convex. 

8.32 Show that if the loss function is a-strongly convex, then if //„ = -d-, the regret bound for the 
subgradient algorithm becomes 


N E £ '^"-i) - n E £»<**> 


G 2 (l + ln A) 

2^N 


(8.146) 


8.33 Consider a batch algorithm that computes the minimum of the empirical loss function, 0*(N), 
having a quadratic convergence rate, that is, 


lnln 


ii0 (o -0*av)ii 2 


Show that an Online algorithm, running for n time instants so as to spend the same computational 
Processing resources as the batch one, achieves for large values of N better performance than 
the batch algorithm, shown as [12] 


l|0,.-0*ll 2 


1 1 

JVlnlnA <<: N 


||0*OV)-0*|| 2 - 


Hint: Use the fact that 


l|0„-0*l| 2 ~- and ||^(A)-0*|| 2 ~4. 

n N 

8.34 Show property (8.1 1 1) for the proximal operator. 

8.35 Show property (8.112) for the proximal operator. 

8.36 Prove that the recursion in (8.118) converges to a minimizer of /. 

8.37 Derive (8.122) from (8.121). 


MATLAB® EXERCISES 

8.38 Consider the regression model, 

y n — 0 0 x n -\- T] n , 

where 0 O e R 2(l(l (/ = 200) and the coefhcients of the unknown vector are obtained randomly 
via the Gaussian distribution ,A/(0. 1). The noise samples are also i.i.d., having zero mean and 
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variance rr 2 — 0.01. The input sequence is a white noise one, i.i.d. generated via the Gaussian 
AA( 0, 1). 

Using as training data the samples (y n ,x n ) eRx R 200 , /7 = 1,2,..., run the APA (Algo- 
rithm 5.2), the NLMS (Algorithm 5.3), the RLS (Algorithm 6.1), and the APSM (Algorithm 8.2) 
algorithms to estimate the unknown 6 a . 

For the APA, choose /x — 0.2, 8 — 0.001, and q — 30. For the APSM, /x = 0.5 x M n , e = 
\/2 er, and q = 30. Furthermore, in the NLMS set /x = 1.2 and 8 — 0.001. Finally, for the RLS 
set the forgetting factor /1 equal to 1. Run 100 independent experiments and plot the average error 
per iteration in dBs, that is, 101og 10 (e 2 ), where e 2 = (y„ — x^6 n - i) 2 . Compare the performance 
of the algorithms. 

Keep the same parameters, but alter the noise variance so that it becomes 0.3. Plot the average 
error per iteration as in the previous experiment. What do you observe regarding the performance 
of the APA compared to the previous low-noise scenario? 

Keep playing with different parameters and study the effect on the convergence speed and the 
error floor in which the algorithms converge. 

8.39 Create an ad hoc network having 10 nodes and a total number of 32 connections. Generate at 
each node the data which adhere to the following model: 

yk(n ) = 0 T o x k (n) + rjk(n), k = 1,.... 10. 

The unknown vector 0 o e M 60 and its coefficients are generated randomly via the Gaussian 
A' r (0. 1). The input vectors are i.i.d. and follow an A r (0, 1). Moreover, the noise samples are 
i.i.d. generated from zero mean Gaussians with variances corresponding to different signal-to- 
noise levels, varying randomly from 20 to 25 dBs from node to node. 

For the unknown vector estimation employ the combine-then-adapt diffusion APSM (Al¬ 
gorithm 8.3), adapt-then-combine LMS (Algorithm 5.7), combine-then-adapt LMS (Algo¬ 
rithm 5.8), and noncooperative LMS (Algorithm 5.1) algorithms. For the combine-then-adapt 
APSM, set ji„ = 0.5 x M n , e k = V2a k , and q = 20. For the adapt-then-combine, combine-then- 
adapt, and noncooperative LMS, set the step size equal to 0.03. Finally, choose the combination 
weights a mk with respect to the Metropolis rule (Remarks 5.4). 

Run 100 independent experiments and plot the average MSD per iteration in dBs, that is, 

/ 1 K 

MSD(n) = 101og 10 - J2 II 0 k (n) - OoW 2 
\ k= 1 

Compare the performance of the combine-then-adapt APSM with the performance of the LMS- 
based algorithms. 

Keep playing with different parameters for the involved algorithms and observe their influence 
on the obtained performance. 

8.40 Download the banknote authentication data set. 11 Develop a MATLAB® program that imple- 
ments the PEGASOS algorithm for classification (Algorithm 8.4). Keep 90% of the data as a 


n 


https://archive.ics. uci.edu/ml/datasets/banknote+authentication. 
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training set and the remaining 10% as a test set. Set X — 0.1 and m — 1, 10, 30. Once the training 
phase has been completed, freeze the parameters for the obtained classifier. Compute the classi- 
fication error on the test set using the obtained classifier. Compare the classification error for the 
three choices of m. 
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9.1 INTRODUCTION 

In Chapter 3, the notion of regularization was introduced as a tool to address a number of problems 
that are usually encountered in machine learning. Improving the performance of an estimator by shrink- 
ing the norm of the minimum variance unbiased (MVU) estimator, guarding against overfitting, coping 
with ill-conditioning, and providing a solution to an underdetermined set of equations are some notable 
examples where regularization has provided successful answers. Some of the advantages were demon- 
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strated via the ridge regression concept, where the sum of squared errors cost function was combined, 
in a tradeoff rationale, with the squared Euclidean norm of the desired solution. 

In this and the next chapter, our interest will be on alternatives in the Euclidean norm, and in partic- 
ular the focus will revolve around the l \ norm; this is the sum of the absolute values of the components 
comprising a vector. Although seeking a solution to a problem via the i\ norm regularization of a cost 
function has been known and used since the 1970s, it is only recently that it has become the focus of 
attention of a massive volume of research in the context of compressed sensing. At the heart of this 
problem lies an underdetermined set of linear equations, which, in general, accepts an infinite number 
of Solutions. However, in a number of cases, an extra piece of information is available: the true model, 
whose estimate we want to obtain, is sparse; that is, only a few of its coordinates are nonzero. It turns 
out that a large number of commonly used applications can be cast under such a scenario and can 
benefit by sparse modeling. 

Besides its practical significance, sparsity-aware learning has offered to the scientific community 
novel theoretical tools and Solutions to problems that only a few years ago seemed intractable. This 
is also a reason that this is an interdisciplinary field of research encompassing scientists from, for 
example, mathematics, statistics, machine learning, and signal processing. Moreover, it has already 
been applied in many areas, ranging from biomedicine to Communications and astronomy. In this and 
the following chapters, I made an effort to present in a unifying way the basic notions and ideas that 
run across this field. The goal is to provide the reader with an overview of the major contributions 
that have taken place in the theoretical and algorithmic fronts and have been Consolidated as a distinet 
scientific area. 

In the current chapter, the focus is on presenting the main concepts and theoretical foundations 
related to sparsity-aware learning techniques. We start by reviewing various norms. Then we move 
on to establish conditions on the recovery of sparse vectors, or vectors that are sparse in a transform 
domain, using less observations than the dimension of the corresponding space. Geometry plays an 
important part in our approach. Finally, some theoretical advances that tie sparsity and sampling theory 
are presented. At the end of the chapter, a case study concerning image denoising is discussed. 


9.2 SEARCHING FOR A NORM 


Mathematicians have been very imaginative in proposing various norms in order to equip linear spaces. 
Among the most popular norms used in functional analysis are the l p norms. To tailor things to our 
needs, given a vector 0 e R 1 , its l p norm is defined as 



(9.1) 


For p — 2, the Euclidean or In norm is obtained, and for p — 1, (9.1) results in the i\ norm, that is, 


0iii = Ei 0 'i- 


; = 1 


(9.2) 
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If we let p — > oo, then we get the £ 00 norm; let |(9; max | := max{|0i|, \6i\, ..., |0/|}, and note that 



that is, Htflloo is equal to the maximum of the absolute values of the coordinates of 0. One can show 
that ali the l p norms are true norms for p > 1; that is, satisfy ali four requirements that a function 
M 1 1 —> [0, oo) must respect in order to be called a norm, that is, 

1. I|0|| P >O, 

2. ||0|| p = O<t»0 = O, 

3. ||a0||p = |o!| ||0|| p , Va e K, 

4. ||0i+02llp < l|0ill p +l|02ll p . 

The third condition enforces the norm function to be ( positively ) homogeneous and the fourth one is 
the triangle inequality. These properties also guarantee that any function that is a norm is also a convex 
one (Problem 9.3). Though strictly speaking, if we allow p > 0 to take values less than one in (9.1), the 
resulting function is not a true norm (Problem 9.8), we may stili call them norms, although we know 
that this is an abuse of the definition of a norm. An interesting case, which will be used extensively in 
this chapter, is the £o norm, which can be obtained as the limit, for p —> 0, of 

/ 7 

II* Ilo- lim \\0\\ p p = lim V|0, | p = Vx( 0 ,oo)(|0/1), (9.4) 

p^o o-H' f-f 

where xa(') ' s the characteristic function with respect to a set A. defined as 


xa( r ) 


|l, if r e A, 
|o, ifr i A. 


That is, the £q norm is equal to the number of nonzero components of the respective vector. It is very 
easy to check that this function is not a true norm. Indeed, this is not homogeneous, that is, ||a0|| o A 
|of| |$ || () , Va A 1- Fig. 9.1 shows the isovalue curves, in the two-dimensional space, that correspond to 
\\0\\ p — 1, for p — 0, 0.5, 1,2, and oo. Observe that for the Euclidean norm the isovalue curve has the 
shape of a circle and for the £\ norm the shape of a rhombus. We refer to them as the £2 and the £.\ 
balls, respectively, by slightly “abusing” the meaning of a ball. Observe that in the case of the A) norm, 
the isovalue curve comprises both the horizontal and the vertical axes, excluding the (0, 0) element. If 
we restrict the size of the £q norm to be less than one, then the corresponding set of points becomes a 
singleton, that is, (0, 0). Also, the set of all the two-dimensional points that have £q norm less than or 
equal to two is the R 2 space. This slightly “strange” behavior is a consequence of the discrete nature 
of this “norm.” 


1 Strictly speaking, a ball must also contain all the points in the interior, that is, all concentric spheres of smaller radius (Chap¬ 

ter 8). 
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FIGURE 9.1 

The isovalue curves for ||fl|| p = 1 and for various values of p, in the two-dimensional space. Observe that for the lo 
norm, the respective values cover the two axes with the exception of the point (0, 0). For the t\ norm, the isovalue 
curve is a rhombus, and for the 1 2 (Euclidean) norm, it is a circle. 



6 


FIGURE 9.2 

Observe that the epigraph, that is, the region above the graph, is nonconvex for values p < 1, indicating the non- 
convexity of the respective \6\ p function. The value p = 1 is the smallest one for which convexity is retained. Also 
note that, for large values of p > 1, the contribution of small values of \0\ < 1 to the respective norm becomes 
insignificant. 


Fig. 9.2 shows the graph of \9\ p , which is the individual contribution of each component of a vector 
to the l p norm, for different values of p. Observe that (a) for p < 1, the region that is formed above the 
graph (epigraph, see Chapter 8) is not a convex one, which verifies what we have already said, that is, 
the respective function is not a true norm; and (b) for values of the argument \6\ > 1, the larger the value 
of p > 1 and the larger the value of \Q\, the higher the contribution of the respective component to the 
norm. Hence, if l p norms, p > 1, are used in the regularization method, components with large values 
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become the dominant ones and the optimization algorithm will concentrate on these by penalizing them 
to get smaller so that the overall cost can be reduced. The opposite is true for values \6\ < 1; l p , p > 1, 
norms tend to push the contribution of such components to zero. The l \ norm is the only one (among 
p > 1) that retains relatively large values even for small values of \0\ < 1 and, hence, components with 
small values can stili have a say in the optimization process and can be penalized by being pushed to 
smaller values. Hence, if the l\ norm is used to replace the £2 one in Eq. (3.41), only those components 
ofthe vector that are really significant in reducing the model misfit measuring term in the regularized 
cost function will be kept, and the rest will beforced to zero. The same tendency, yet more aggressive, 
is true for 0 <P< 1. The extreme case is when one considers the £0 norm. Even a small increase of a 
component from zero makes its contribution to the norm large, so the optimizing algorithm has to be 
very “cautious” in making an element nonzero. 

In a nutshell, from all the true norms (p > 1), the l\ is the only one that shows respect to small 
values. The rest of the l p norms, p > 1, just squeeze them to make their values even smaller, and care 
mainly for the large values. We will return to this point very soon. 


9.3 THE LEAST ABSOLUTE SHRINKAGE AND SELECTION 
OPERATOR (LASSO) 

In Chapter 3, we discussed some of the benefits in adopting the regularization method for enhancing 
the performance of an estimator. In this chapter, we will see and study more reasons that justify the use 
of regularization. The first one refers to what is known as the interpretation power of an estimator. For 
example, in the regression task, we want to select those components 0,- of 0 that have the most important 
say in the formation of the output variable. This is very important if the number of parameters, /, is large 
and we want to concentrate on the most important of them. In a classification task, not all features are 
informative, hence one would like to keep the most informative of them and make the less informative 
ones equal to zero. Another related problem refers to those cases where we know, a priori, that a 
number of the components of a parameter vector are zero, but we do not know which ones. Now, the 
discussion at the end of the previous section becomes more meaningful. Can we use, while regularizing, 
an appropriate norm that can assist the optimization process (a) in unveiling such zeros or (b) to put 
more emphasis on the most significant of its components, those that play a decisive role in reducing the 
misfit measuring term in the regularized cost function, and set the rest of them equal to zero? Although 
the l p norms, with p < 1, seem to be the natural choice for such a regularization, the fact that they are 
not convex makes the optimization process hard. The l\ norm is the one that is “closest” to them, yet 
it retains the computationally attractive property of convexity. 

The £1 norm has been used for such problems for a long time. In the 1970s, it was used in seis- 
mology [27,86], where the reflected signal that indicates changes in the various earth substrates is a 
sparse one, that is, very few values are relatively large and the rest are small and insignificant. Since 
then, it has been used to tackle similar problems in different applications (e.g., [40,80]). However, one 
can trace two papers that were catalytic in providing the spark for the current strong interest around 
the l\ norm. One came from statistics [89], which addressed the LASSO task (first formulated, to our 
knowledge, in [80]), to be discussed next, and the other came from the signal analysis community [26], 
which formulated the basis pursuit, to be discussed in a later section. 
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We first address our familiar regression task 

y = X0 + r), y,i)eR N , 0eR l , N>1, 

and obtain the estimate of the unknown parameter 0 via the sum of squared error cost, regularized by 
the l\ norm, that is, for ). > 0, 

0 :=argmin @eK ;L(0,k) (9.5) 

/ N 2 

:=argmin @eK , I (y„ --c,^) 2 +11|01 

\n =1 

= argmin SeR ; ((y - X0) T (y - X0) + X 

Following the discussion with respect to the bias term given in Section 3.8 and in order to simplify the 
analysis, we will assume hereafter, without harming generality, that the data are of zero mean values. 
If this is not the case, the data can be centered by subtracting their respective sample means. 

It turns out that the task in (9.6) can be equivalently written in the following two formulations: 

0 : min (y - X0) r (y - X0), 

$e®! 

s.t. H^lh < p, (9.7) 

or 

0 : min ||0||i, 

OeS! 

s.t. (y - X0) T (y - X0) <e, (9.8) 

given the user-defmed parameters p, e > 0. The formulation in (9.7) is known as the LASSO and the 
one in (9.8) as the basis pursuit denoising (BPDN) (e.g., [15]). All three formulations are equivalent 
for specific choices of /., e, and p (see, e.g., [ 14 ]). Observe that the minimized cost function in (9.6) 
corresponds to the Lagrangian of the formulation in (9.7). However, this functional dependence among 
A,, e, and p is hard to compute, unless the columns of X are mutually orthogonal. Moreover, this 
equivalence does not necessarily imply that all three formulations are equally easy or difficult to solve. 
As we will see later in this chapter, algorithms have been developed along each one of the previous 
formulations. From now on, we will refer to all three formulations as the LASSO task, in a slight abuse 
of the Standard terminology, and the specific formulation will be apparent from the context, if not stated 
explicitly. 

We know that ridge regression admits a closed form solution, that is, 

0 r = (x t X + XI ) l X T y. 

In contrast, this is not the case for LASSO, and its solution requires iterative techniques. It is straight- 
forward to see that LASSO can be formulated as a Standard convex quadratic problem with linear 




( 9 . 6 ) 
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inequalities. Indeed, we can rewrite (9.6) as 


min 

L 

s.t. 

which can be solved by any Standard convex optimization method (e.g., [14,101]). The reason that 
developing algorithms for the LASSO task has been at the center of an intense research activity is 
due to the emphasis on obtaining efficient algorithms by exploiting the specific nature of this task, 
especially for cases where / is very large, as is often the case in practice. 

In order to get better insight into the nature of the solution that is obtained by LASSO, let us 
assume that the regressors are mutually orthogonal and of unit norm, hence X 1 X = I. Orthogonality 
of the input matrix helps to decouple the coordinates and results to / one-dimensional problems that 
can be solved analytically. For this case, the LS estimate becomes 

0 LS = (X T X)- l X T y = X T y, 

and the ridge regression gives 

Or = t ^—0ls, ( 9 . 9 ) 

that is, every component of the LS estimate is simply shrunk by the same factor, jT_■ se e, also, Sec- 
tion 6.5. 

In the case of l\ regularization, the minimized Lagrangian function is no more differentiable, due 
to the presence of the absolute values in the i \ norm. So, in this case, we have to consider the notion 
of the subdifferential. It is known (Chapter 8) that if the zero vector belongs to the subdifferential set 
of a convex function at a point, this means that this point corresponds to a minimum of the function. 
Taking the subdifferential of the Lagrangian defined in (9.6) and recalling that the subdifferential set of 
a differentiable function includes as its single element the respective gradient, the estimate 6 1 , resulting 
from the l\ regularized task, must satisfy 

0 e -2 X T y + 2X T X0 + Xd ||0|li, 


(y-X0) T (y-X0) + xJ^Ui, 


i'=t 


Ui < di < Uj , 


Ui > 0 , 


i — 1 , 2 ,...,/, 


where 3 stands for the subdifferential set (Chapter 8). If X has orthonormal columns, the previous 
equation can be written component-wise as follows: 


0 e —0Ls,/ + ,i + ^ 


Bu 


Vi, 


(9.10) 


where the subdifferential of the function | • |, derived in Example 8.4 (Chapter 8), is given as 


3 | 0 | = 


{!}. 

{-!}, 

[- 1 , 1 ] 


if<9 >0, 
if 6> <0, 
if<9 = 0. 
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Thus, we can now write for each component of the LASSO optimal estimate 


if0i,i>O, (9.11) 

if 0i,/ < 0. (9.12) 

Note that (9.11) can only be true if 0 ls,/ > §, and (9.12) only if 0 ls,i < — Moreover, in the case 
where 9\j = 0, (9.10) and the subdifferential of | • | suggest that necessarily |#ls,/| < 4- Concluding, 
we can write in a more compact way that 


01,i = 


k 

0LS,/ - 2 ’ 

k 

0LS ’ i+ 2’ 


/ 



01,/ = sgn(0 L s,/) ( 

0LS.1 

-1 : soft thresholding operation, 

2 /+ 


where (■)+ denotes the “positive part” of the respective argument; it is equal to the argument if this 
is nonnegative, and zero otherwise. This is very interesting indeed. In contrast to the ridge regression 
which shrinks ali coordinates of the unregularized LS solution by the same factor, LASSO forces ali 
coordinates, whose absolute value is less than or equal to k /2, to zero, and the rest of the coordinates 
are reduced, in absolute value, by the same amount k/2. This is known as soft thresholding, to distin- 
guish it from the hard thresholding operation; the latter is defined as 9 ■ X(0,oo) (|0| — e K, where 
X(O.oc) (■) stands for the characteristic function with respect to the set (0, oo). Fig. 9.3 shows the graphs 
illustrating the effect that the ridge regression, LASSO, and hard thresholding have on the unregular¬ 
ized LS solution, as a function of its value (horizontal axis). Note that our discussion here, simplified 
via the orthonormal input matrix case, has quantified what we said before about the tendency of the 
l\ norm to push small values to become exactly zero. This will be further strengthened, via a more 
rigorous mathematical formulation, in Section 9.5. 

Example 9.1. Assume that the unregularized LS solution, for a given regression task, y — X0 + is 
given by 

0 L s = [0.2, -0.7, 0.8, -0.1, l.of. 



- 2-1012 


FIGURE 9.3 

Curves relating the output (vertical) to the input (horizontal) for the hard thresholding and the soft thresholding 
operators shown together with the linear operator that is associated with the ridge regression, for the same value of 
1 = 1 . 
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Derive the Solutions for the corresponding ridge regression and i\ norm regularization tasks. Assume 
that the input matrix X has orthonormal columns and that the regularization parameter i s A = 1. Also, 
what is the resuit of hard thresholding the vector 0\ s with threshold equal to 0.5? 

We know that the corresponding solution for the ridge regression is 

0 K = —0 LS = [0.1, -0.35, 0.4, -0.05, 0.5] r . 

1 + A. 

The solution for the l\ norm regularization is given by soft thresholding, with threshold equal to A/2 = 
0.5, and hence the corresponding vector is 

01 = [0, -0.2, 0.3,0, 0.5] 7 '. 

The resuit of the hard thresholding operation is the vector [0, —0.7, 0.8, 0, 1.0] r . 

Remarks 9.1. 


• The hard and soft thresholding rules are only two possibilities out of a larger number of alternatives. 
Note that the hard thresholding operation is defined via a discontinuous function, and this makes 
this rule unstable in the sense of being very sensitive to small changes of the input. Moreover, this 
shrinking rule tends to exhibit large variance in the resulting estimates. The soft thresholding rule is 
a continuous function, but, as readily seen from the graph in Fig. 9.3, it introduces bias even for the 
large values of the input argument. In order to ameliorate such shortcomings, a number of alterna- 
tive thresholding operators have been introduced and studied both theoretically and experimentally. 
Although these are not within the mainstream of our interest, we provide two popular examples for 
the sake of completeness—the smoothly clipped absolute deviation (SCAD) thresholding rule, 


0SCAD = 


sgn(0) (|0| - Ascad)+ , 

(a - \)6 — aA SC AD s g n (0) 
a — 2 
< 9 , 


\0\ < 2Ascad, 

2Ascad < \0\ < aAscAD. 

\ 6 \ > aA SC AD, 


and the nonnegative garrote thresholding rule, 


0 , 




|01 ^ Ag arr , 

\&\ > Agarr- 


Fig. 9.4 shows the respective graphs. Observe that in both cases, an effort has been made to re¬ 
move the discontinuity (associated with the hard thresholding) and to remove/reduce the bias for 
large values of the input argument. The parameter a > 2 is a user-defined one. For a more de- 
tailed related discussion, the interested reader can refer, for example, to [2]. In [83], a generalized 
thresholding rule is suggested that encompasses all previously mentioned ones as special cases. 
Moreover, the proposed framework is general enough to provide means for designing novel thresh¬ 
olding rules and/or incorporating a priori information associated with the sparsity level, i.e., the 
number of nonzero components, of the sparse vector to be recovered. 
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FIGURE 9.4 

Output (vertical)-input (horizotal) graph for the SCAD and nonnegative garrote rules with parameters a = 3.7 and 
A.scad = k garr = 1 ■ Observe that both rules smoothen out the discontinuity associated with the hard thresholding 
rule. Note, also, that the SCAD rule removes the bias associated with the soft thresholding rule for large values 
of the input variable. On the contrary, the garrote thresholding rule allows some bias for large input values, which 
diminishes as A. garr gets smaller and smaller. 


9.4 SPARSE SIGNAL REPRESENTATION 

In the previous section, we brought into our discussion the need to take special care for zeros. Sparsity 
is an attribute that is met in a plethora of natural signals, because nature tends to be parsimonious. The 
notion of and need for parsimonious models was also discussed in Chapter 3, in the context of inverse 
problems in machine learning tasks. In this section, we will briefly present a number of application 
cases where the existence of zeros in a mathematical expansion is of paramount importance; hence, it 
justifies our search for and development of related analysis tools. 

In Chapter 4, we discussed the task of echo cancelation. In a number of cases, the echo path, 
represented by a vector comprising the values of the impulse response samples, is a sparse one. This is 
the case, for example, in internet telephony and in acoustic and network environments (e.g., [3,10,73]). 
Fig. 9.5 shows the impulse response of such an echo path. The impulse response of the echo path is 
of short duration; however, the delay with which it appears is not known. So, in order to model it, 
one has to use a long impulse response, yet only a relatively small number of the coefficients will be 
significant and the rest will be close to zero. Of course, one could ask, why not use an LMS or an RLS, 
and eventually the significant coefficients will be identified? The answer is that this turns out not to be 
the most efficient way to tackle such problems, because the convergence of the algorithm can be very 
slow. In contrast, if one embeds, somehow, into the problem the a priori information concerning the 
existence of (almost) zero coefficients, then the convergence speed can be significantly increased and 
also better error floors can be attained. 

A similar situation occurs in wireless communication systems, which involve multipath channels. 
A typical application is in high-definition television (HDTV) systems, where the involved communi¬ 
cation channels consist of a few nonnegligible coefficients, some of which may have quite large time 
delays with respect to the main signal (see, e.g., [4,32,52,77]). If the information signal is transmitted 
at high Symbol rates through such a dispersive channel, then the introduced intersymbol interference 
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FIGURE 9.5 

The impulse response function of an echo path in a telephone network. Observe that although it is of relatively short 
duration, it is not a priori known where exactly in time it will occur. 


(ISI) has a span of several tens up to hundreds of Symbol intervals. This in tum implies that quite long 
channel estimators are required at the receiver’s end in order to reduce effectively the ISI component 
of the received signal, although only a small part of it has values substantially different from zero. 
The situation is even more demanding when the channel-frequency response exhibits deep nulls. More 
recently, sparsity has been exploited in channel estimation for multicarrier systems, both for single 
antenna as well as for multiple-input-multiple-output (MIMO) systems [46,47]. A thorough, in-depth 
treatment related to sparsity in multipath communication systems is provided in [5]. 

Another example, which might be more widely known, is that of signal compression. It turns out 
that if the signal modalities with which we communicate (e.g., speech) and also sense the world (e.g.. 



FIGURE 9.6 


(A) A 512 x 512 pixel image and (B) the magnitude of its DCT components in descending order and logarithmic 
scale. Note that more than 95% of the total energy is contributed by only 5% of the largest components. 
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images, audio) are transformed into a suitably chosen domain, then they are sparsely represented; only a 
relatively small number of the signal components in this domain are large, and the rest are close to zero. 
As an example, Fig. 9.6A shows an image and Fig. 9.6B shows the plot of the magnitude of the obtained 
discrete cosine transform (DCT) components, which are computed by writing the corresponding image 
array as a vector in lexicographic order. Note that more than 95% of the total energy is contributed by 
only 5% of the largest components. This is at the heart of any compression technique. Only the large 
coefficients are chosen to be coded and the rest are considered to be zero. Hence, significant gains are 
obtained in memory/bandwidth requirements while storing/transmitting such signals, without much 
perceptual loss. Depending on the modality, different transforms are used. For example, in JPEG-2000, 
an image array, represented in terms of a vector that contains the intensity of the gray levels of the image 
pixels, is transformed via the discrete wavelet transform (DWT), resulting in a transformed vector that 
comprises only a few large components. 

Let 

s — <& H s, s, s e C 1 , (9.14) 

where s is the vector of the “raw” signal samples, s is the (complex-valued) vector of the transformed 
ones, and <f> is the I x 1 transformation matrix. Often, this is an orthonormal/unitary matrix, 4> w <!> = I. 
Basically, a transform is nothing more than a projection of a vector on a new set of coordinate axes, 
which comprise the columns of the transformation matrix <J>. Celebrated examples of such transforms 
are the wavelet, the discrete Fourier (DFT), and the discrete cosine (DCT) transforms (e.g., [ 87 ]). In 
such cases, where the transformation matrix is orthonormal, one can write 


s = 'FS, 


(9.15) 


where T = <t>. Eq. (9.14) is known as the analysis and (9.15) as the synthesis equation. 

Compression via such transforms exploits the fact that many signals in nature, which are rich in 
context, can be compactly represented in an appropriately chosen basis, depending on the modality 
of the signal. Very often, the construction of such bases tries to “imitate” the sensory systems that 
the human brain has developed in order to sense these signals; and we know that nature (in contrast to 
modern humans) does not like to waste resources. A Standard compression task comprises the following 
stages: (a) obtain the I components of s via the analysis step (9.14); (b) keep the k most significant of 
them; (c) code these values, as well as their respective locations in the transformed vector s; and 
(d) obtain the (approximate) original signal s when needed (after storage or transmission), via the 
synthesis equation (Eq. (9.15)), where in place of s only its k most significant components are used, 
which are the ones that were coded, while the rest are set equal to zero. However, there is something 
unorthodox in this process of compression as it has been practiced until very recently. One processes 
(transforms) large signal vectors of Z coordinates, where I in practice can be quite large, and then uses 
only a small percentage of the transformed coefficients, while the rest are simply ignored. Moreover, 
one has to store/transmit the location of the respective large coefficients that are finally coded. 

A natural question that is raised is the following: Because s in the synthesis equation is (approxi- 
mately) sparse, can one compute it via an alternative path to the analysis equation in (9.14)? The issue 
here is to investigate whether one could use a more informative way of sampling the available raw data 
so that fewer than / samples/observations are sufficient to recover ali the necessary Information. The 
ideal case would be to recover it via a set of k such samples, because this is the number of the signif- 
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icant free parameters. On the other hand, if this sounds a bit extreme, can one obtain /V (k < N <^J) 
such signal-related measurements, from which s can eventually be retrieved? It tums out that such an 
approach is possible and it leads to the solution of an underdetermined system of linear equations, 
under the constraint that the unknown target vector is a sparse one. 

The importance of such techniques becomes even more apparent when, instead of an orthonormal 
basis, as discussed before, a more general type of expansion is adopted, in terms of what is known 
as overcomplete dictionaries. A dictionary [65] is a collection of parameterized waveforms, which are 
discrete-time signal samples, represented as vectors e C l , i e X, where X is an integer index set. For 
example, the columns of a DFT or a discrete wavelet (DWT) matrix comprise a dictionary. These are 
two examples of what are known as complete dictionaries, which consist of / (orthonormal) vectors, 
that is, a number equal to the length of the signal vector. However, in many cases in practice, using such 
dictionaries is very restrictive. Let us take, for example, a segment of audio signal, from a news media 
or a video, that needs to be processed. This consists, in general, of different types of signals, namely, 
speech, music, and environmental sounds. For each type of these signals, different signal vectors may 
be more appropriate in the expansion for the analysis. For example, music signals are characterized 
by a strong harmonic content and the use of sinusoids seems to be best for compression, while for 
speech signals a Gabor-type signal expansion (sinusoids of various frequencies weighted by sufficiently 
narrow pulses at different locations in time [3 1 ,87]) may be a better choice. The same applies when one 
deals with an image. Different parts of an image, such as parts that are smooth or contain sharp edges, 
may demand a different expansion vector set for obtaining the best overall performance. The more 
recent tendency, in order to satisfy such needs, is to use overcomplete dictionaries. Such dictionaries 
can be obtained, for example, by concatenating different dictionaries together, for example, a DFT 
and a DWT matrix to resuit in a combined I x 21 transformation matrix. Alternatively, a dictionary 
can be “trained” in order to effectively represent a set of available signal exemplars, a task that is 
often referred to as dictionary learning [75,78,90,100]. While using such overcomplete dictionaries, 
the synthesis equation takes the form 



(9.16) 


Note that, now, the analysis is an ill-posed problem, because the elements (usually called 

atoms) of the dictionary are not linearly independent, and there is not a unique set of parameters 
{0j ) /g j that generates s. Moreover, we expect most of these parameters to be (nearly) zero. Note that 
in such cases, the cardinality of X is larger than /. This necessarily leads to underdetermined systems 
of equations with inhnitely many Solutions. The question that is now raised is whether we can exploit 
the fact that most of these parameters are known to be zero, in order to come up with a unique solution. 
If yes, under which conditions is such a solution possible? We will return to the task of learning 
dictionaries in Chapter 19. 

Besides the previous examples, there are a number of cases where an underdetermined system of 
equations is the resuit of our inability to obtain a sufficiently large number of measurements, due to 
physical and technical constraints. This is the case in MRI imaging, which will be presented in more 
detail in Section 10.3. 
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9.5 IN SEARCH OF THE SPARSEST SOLUTION 

Inspired by the discussion in the previous section, we now tum our attention to the task of solving 
underdetermined systems of equations by imposing the sparsity constraint on the solution. We will 
develop the theoretical setup in the context of regression and we will adhere to the notation that 
has been adopted for this task. Moreover, we will focus on the real-valued data case in order to 
simplify the presentation. The theory can be readily extended to the more general complex-valued 
data case (see, e.g., [64,99]). We assume that we are given a set of observations/measurements, 
y '■= [y\, yi, ■ ■ ■ , vn] T e according to the linear model 

y = X0 , y eR N , 0 eR 1 , l > N, ( 9 . 17 ) 

where X is the N x l input matrix, which is assumed to be of full row rank, that is, rank(X) = N. Our 
starting point is the noiseless case. The linear system of equations in (9.17) is an underdetermined one 
and accepts an infinite number of Solutions. The set of possible Solutions lies in the intersection of the 
N hyperplanes in the /-dimensional space, 

|$ e R 1 : y n — x^O j, n=l,2,...,N. 

We know from geometry that the intersection of N nonparallel hyperplanes (which in our case is 
guaranteed by the fact that X has been assumed to be full row rank, hence x„, n = 1,2,..., N, are 
linearly independent) is a plane of dimensionality / — N (e.g., the intersection of two [nonparallel] 
[hypcr|planes in the three-dimensional space is a straight line, that is, a plane of dimensionality equal 
to one). In a more formal way, the set of ali possible Solutions, to be denoted as 0, is an affine set. An 
affine set is the translation of a linear subspace by a constant vector. Let us pursue this a bit further, 
because we will need it later on. 

Let the null space of X be denoted as null(X) (sometimes denoted as Af(X)), and it is defined as 
the linear subspace 

null(X) = jzeR 7 :Xz = o}. 

Obviously, if Qq is a solution to (9.17), that is, 0 o e ©, then it is easy to verify that V0 e 0, X(6 — 0q) = 
0, or 0 — 0q e null(X). As a resuit, 

© = 0 O + null(X), 

and © is an affine set. We also know from linear algebra basies (and it is easy to show it; see Prob- 
lem 9.9) that the null space of a full row rank matrix, N x /, / > N, is a subspace of dimensionality 
l — N. Fig. 9.7 illustrates the case for one measurement sample in the two-dimensional space, 1 — 2 
and N = 1. The set of Solutions © is a straight line, which is the translation of the linear subspace 
Crossing the origin (the null(X)). Therefore, if one wants to select a single point among all the points 
that lie in the affine set of Solutions, ©, then an extra constraint/a priori knowledge has to be imposed. 
In the sequel, three such possibilities are examined. 


- In It/, a hyperplane is of dimension 1 — 1. A plane has dimension lower than / — 1. 
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THE l 2 NORM MINIMIZER 

Our goal now becomes to pick a point in (the affine set) © that corresponds to the minimum £2 norm. 
This is equivalent to solving the following constrained task: 

min || 01| o, 

0eR' 

s.t. xj l 0 = y n , n=l,2,...,N. (9.18) 

We already know from Section 6.4 (and one can rederive it by employing Lagrange multipliers; see 
Problem 9.10) that the previous optimization task accepts a unique solution given in closed form as 

0 = X T (XX T ) 1 y. (9.19) 

The geometric interpretation of this solution is provided in Fig. 9.7A, for the case of / = 2 and N — 1. 
The radius of the Euclidean norm ball keeps increasing, until it touches the plane that contains the 
Solutions. This point is the one with the minimum £ 2 norm or, equivalently, the point that lies closest 
to the origin. Equivalently, the point 6 can be seen as the (metric) projection of 0 onto 0. 

Minimizing the £ 2 norm in order to solve a linear set of underdetermined equations has been used 
in various applications. The closest to us is in the context of determining the unknown parameters in 
an expansion using an overcomplete dictionary of functions (vectors) [ 35 ]. A main drawback of this 
method is that it is not sparsity preserving. There is no guarantee that the solution in (9.19) will give 
zeros even if the true model vector 0 has zeros. Moreover, the method is resolution-limited [26]. This 
means that even if there may be a sharp contribution of specific atoms in the dictionary, this is not por- 
trayed in the obtained solution. This is a consequence of the fact that the information provided by X X 1 




FIGURE 9.7 

The set of Solutions 0 is an affine set (gray line), which is a translation of the null(X) subspace (red line). (A) The 
I 2 norm minimizer: The dotted circle corresponds to the smallest I 2 ball that intersects the set 0. As such, the inter- 
section point, 8, is the £2 norm minimizer of the task in (9.18). Note that the vector 9 contains no zero component. 
(B) The 1 1 norm minimizer: The dotted rhombus corresponds to the smallest t\ ball that intersects 0. Hence, the in- 
tersection point, 8, is the solution of the constrained l\ minimization task of (9.21). Note that the obtained estimate 
8 = (0, 1) contains a zero. 
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is a global one, containing all atoms of the dictionary in an “averaging” fashion, and the final resuit 
tends to smoothen out the individual contributions, especially when the dictionary is overcomplete. 

THE l 0 NORM MINIMIZER 

Now we turn our attention to the Iq norm (once more, it is pointed out that this is an abuse of the 
definition of the norm), and we make sparsity our new flag under which a solution will be obtained. 
The task now becomes 


min ||0||o, 

0 €«' 

s.t. x^O = y„, n = 1,2, ..., N, (9.20) 

that is, from all the points that lie on the plane of all possible Solutions, find the sparsest one, that is, 
the one with the lowest number of nonzero elements. As a matter of fact, such an approach is within 
the spirit of Occam ’s razor rule—it corresponds to the smallest number of parameters that can explain 
the obtained observations. The points that are now raised are: 

• Is a solution to this problem unique, and under which conditions? 

• Can a solution be obtained with low enough complexity in realistic time? 

We postpone the answer to the first question until later. As for the second one, the news is not good. 
Minimizing the £q norm under a set of linear constraints is a task of combinatorial nature, and as a 
matter of fact, the problem is, in general, NP-hard [72]. The way to approach the problem is to consider 
all possible combinations of zeros in 0, removing the respective columns of X in (9.17), and check 
whether the system of equations is satisfied; keep as Solutions the ones with the smallest number of 
nonzero elements. Such a searching technique exhibits complexity of an exponential dependence on /. 
Fig. 9.7A illustrates the two points ((1.5, 0) and (0, 1)) that comprise the solution set of minimizing 
the £o norm for the single measurement (constraint) case. 

THE h NORM MINIMIZER 

The current task is now given by 


min Hflllj, 

6eR l 

s.t. x^O = y n , n = 1, 2,..., N. (9.21) 

Fig. 9.7B illustrates the geometry. The t \ ball is increased until it touches the affine set of the possible 
Solutions. For this specific geometry, the solution is the point (0, 1), which is a sparse solution. In our 
discussion in Section 9.2, we saw that the i\ norm is the one, out of all i p , p > I norms, that bears 
some similarity with the sparsity promoting (nonconvex) l p , p < I “norms.” Also, we have commented 
that the i \ norm encourages zeros when the respective values are small. In the sequel, we will state one 
lemma that establishes this zero favoring property in a more formal way. The t\ norm minimization 
task is also known as basis pursuit and it was suggested for decomposing a vector signal in terms of 
the atoms of an overcomplete dictionary [26]. 
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The l\ minimizer can be brought into the Standard linear programming (LP) form and then can be 
solved by recalling any related method; the simplex method and the more recent interior point method 
are two possibilities (see, e.g., [14,33]). Indeed, consider the LP task 

min c T x, 

X 

s.t. Ax — b, 
x > 0 . 


To verify that our l \ minimizer can be cast in the previous form, note first that any I -dimensional vector 
0 can be decomposed as 

0 — u — v, u > 0 . v > 0 . 

Indeed, this holds true if, for example, 

u:=0 + , v := (— 0) + , 


where x + stands for the vector obtained after keeping the positive components of x and setting the rest 
equal to zero. Moreover, note that 


0|li = [l,l,...,l] 


0 + 

.(-*)+. 


= [ 1 , 1 ,...,!] 



Hence, our t\ minimization task can be recast in the LP form if 

c := [1,1,..., 1] T , x [u T , v T ] T , 
A:=[X,-X], b := y. 


CHARACTERIZATION OF THE h NORM MINIMIZER 

Lemma 9.1. An element 0 in the affine set 0 of the solutioris of the underdetermined linear system 
(9.17) has minimal t\ norm if and only if the following condition is satisfied: 


sgn (di)zi < l z il» Vzenull(X). (9.22) 

i: 6i i: 9i= 0 

Moreover, the l\ minimizer is unique if and only if the inequality in (9.22) is a striet one for ali z ^ 0 
(see, e.g., [74] and Probi em 9.11). 

Remarks 9.2. 

• The previous lemma has a very interesting and important consequence. If 0 is the unique minimizer 
of (9.21), then 


card{/ : 0,- = 0} > dim(null(2Q), 


(9.23) 
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where card{-} denotes the cardinality of a set. In words, the number of zero coordinates of the 
unique minimizer cannot be smaller than the dimension of the null space of X. Indeed, if this is 
not the case, then the unique minimizer could have fewer zeros than the dimensionality of null(X). 
This means that we can always find a z e null(X), which has zeros in the same locations where the 
coordinates of the unique minimizer are zero, and at the same time it is not identically zero, that is, 
z 7 ^ 0 (Problem 9.12). However, this would violate (9.22), which in the case of uniqueness holds as 
a striet inequality. 

Definition 9.1. A vector 0 is called k-sparse if it has at most k nonzero components. 

Remarks 9.3. 

• If the minimizer of (9.21) is unique , then it is a k-sparse vector with 

k < N. 

This is a direct consequence of Remarks 9.2, and the fact that for the matrix X, 

dim(null(20) = l — rank(X) =1 — N. 

Hence, the number of the nonzero elements of the unique minimizer must be at most equal to N. If 
one resorts to geometry, all the previously stated results become crystal ciear. 

GEOMETRIC INTERPRETATION 

Assume that our target solution resides in the three-dimensional space and that we are given one mea- 
surement, 

}’l =x\6 —xnd\ + X\202 + A'13^3- 

Then the solution lies in the two-dimensional (hyper)plane, which is described by the previous equa- 
tion. To get the minimal l\ solution we keep increasing the size of the l\ ball 3 (all the points that lie 
on the sides/faces of this ball have equal t\ norm) until it touches this plane. The only way that these 
two geometric objects have a single point in common (unique solution) is when they meet at a vertex 
of the diamond. This is shown in Fig. 9.8A. In other words, the resulting solution is 1-sparse, having 
two of its components equal to zero. This complies with the finding stated in Remarks 9.3, because 
now N — 1. For any other orientation of the plane, this will either cut across the t \ ball or will share 
with the diamond an edge or a side. In both cases, there will be infinite Solutions. 

Let us now assume that we are given an extra measurement, 

yi = X21&1 + X2202 + X23O3. 

The solution now lies in the intersection of the two previous planes, which is a straight line. However, 
now, we have more alternatives for a unique solution. A line, for example, © 1 , can either touch the £\ 
ball at a vertex (1-sparse solution) or, as shown in Fig. 9.8B, touch the l\ ball at one of its edges, for 


3 


Observe that in the three-dimensional space the i\ ball looks like a diamond. 
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FIGURE 9.8 

(A) The l\ ball intersecting with a plane. The only possible scenario, for the existence of a unique common in- 
tersecting point of the l\ ball with a plane in the Euclidean R 3 space, is for the point to be located at one of the 
vertices of the l \ ball, that is, to be a 1-sparse vector. (B) The l \ ball intersecting with lines. In this case, the sparsity 
level of the unique intersecting point is relaxed; it could be a 1- or a 2-sparse vector. 


example, © 2 . The latter case corresponds to a solution that lies on a two-dimensional subspace; hence 
it will be a 2-sparse vector. This also complies with the findings stated in Remarks 9.3, because in this 
case we have N — 2,1 = 3, and the sparsity level for a unique solution can be either 1 or 2. 

Note that uniqueness is associated with the particular geometry and orientation of the affine set, 
which is the set of ali possible Solutions of the underdetermined system of equations. For the case of 
the squared 1 2 norm, the solution is always unique. This is a consequence of the (hyper)spherical shape 
formed by the Euclidean norm. From a mathematical point of view, the squared £2 norm is a strictly 
convex function. This is not the case for the l\ norm, which is convex, albeit not a strictly convex 
function (Problem 9.13). 

Example 9.2. Consider a sparse vector parameter [0, l] r , which we assume to be unknown. We will 
use one measurement to sense it. Based on this single measurement, we will use the t \ minimizer of 
(9.21) to recover its true value. Let us see what happens. 

We will consider three different values of the “sensing” (input) vector x in order to obtain the mea¬ 
surement y —x 1 6: (a) x — [^, l] r , (b) x — [1, l] r , and (c) x = [2, l] r . The resulting measurement, 
after sensing 6 by x, is y = 1 for all three previous cases. 

Case a: The solution will lie on the straight line 


0 = 


[0i,0 2 f eR 2 : -0i+0 2 = l 


which is shown in Fig. 9.9A. For this setting, expanding the i\ ball, this will touch the straight line (our 
solution’s affine set) at the vertex [0, I ] T . This is a unique solution, hence it is sparse, and it coincides 
with the true value. 
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Case b: The solution lies on the straight line 

©={[0i,02] r eR 2 :0 1 +0 2 =l). 

which is shown in Fig. 9.9B. For this setup, there is an infinite number of Solutions, including two 
sparse ones. 

Case c: The affine set of Solutions is described by 

©= )[0i,0 2 ] r eK 2 :20 1 +0 2 = lj, 

which is sketched in Fig. 9.9C. The solution in this case is sparse, but it is not the correct one. 

This example is quite informative. If we sense (measure) our unknown parameter vector with ap- 
propriate sensing (input) data, the use of the l\ norm can unveil the true value ofthe parameter vector, 
even ifthe system of equations is underdetermined, provided that the true parameter is sparse. This be- 
comes our new goal; to investigate whether what we have just said can be generalized, and under which 
conditions it holds true. In such a case, the choice of the regressors (which we called sensing vectors) 
and hence the input matrix (which we will refer to more and more frequently as the sensing matrix) 
acquire extra significance. It is not enough for the designer to care only for the rank of the matrix, that 





FIGURE 9.9 

(A) Sensing with x = [^,l] r . (B) Sensing withx = [1, l] r . (C) Sensing withx = [2, l] r . The choice of the sensing 
vector x is crucial to unveiling the true sparse solution (0,1). Only the sensing vector x = [j, l] r identifies uniquely 
the desired (0,1). 
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is, the linear independence of the sensing vectors. One has to make sure that the corresponding affine 
set of the Solutions has such an orientation so that the touch with the l\ ball (as this increases from 
zero to meet this plane) is a “gentle” one; that is, they meet at a single point, and more importantly, at 
the correct one, which is the point that represents the true value of the sparse parameter vector. 

Remarks 9.4. 

• Often in practice, the columns of the input matrix, X, are normalized to unit I2 norm. Although Iq 
norm is insensitive to the values of the nonzero components of 0, this is not the case with the l\ 
and Ii norms. Hence, while trying to minimize the respective norms and at the same time fulhll the 
constraints, components that correspond to columns of X with high energy (norm) are favored over 
the rest. Hence, the latter become more popular candidates to be pushed to zero. In order to avoid 
such situations, the columns of X are normalized to unity by dividing each element of the column 
vector by the respective (Euclidean) norm. 


9.6 UNIQUENESS OF THE 4 MINIMIZER 

Our first goal is to derive .sufficient conditions that guarantee uniqueness of the 1 0 minimizer, which 
has been defined in Section 9.5. 

Definition 9.2. The spark of a full row rank N x I (l > N) matrix, X , denoted as spark(X), is the 
smallest number of its linearly dependent columns. 

According to the previous definition, any m < spark (A) column of X is necessarily linearly inde- 
pendent. The spark of a square, N x N, full rank matrix is equal to N + 1. 

Remarks 9.5. 


• In contrast to the rank of a matrix, which can be easily determined, its spark can only be obtained by 
resorting to a combinatorial search over ali possible combinations of the columns of the respective 
matrix (see, e.g., [15,37]). The notion of the spark was used in the context of sparse representation, 
under the name uniqueness representation property, in [53]. The name “spark” was coined in [37]. 
An interesting discussion relating this matrix index with indices used in other disciplines is given in 
[15]. 

• Note that the notion of “spark” is related to the notion of the minimum Hamming weight of a linear 
code in coding theory (e.g., [60]). 


Example 9.3. Consider the following matrix: 

"100010 
0 10 0 11 
0 0 1 0 0 1 
0 0 0 1 0 0 


The matrix has rank equal to 4 and spark equal to 3. Indeed, any pair of columns is linearly independent. 
On the other hand, the first, second, and fifth columns are linearly dependent. The same is also true 






448 CHAPTER 9 SPARSITY-AWARE LEARNING 


for the combination of the second, third, and sixth columns. Also, the maximum number of linearly 
independent columns is four. 

Lemma 9.2. //'null(X) is the null space ofX, then 

||0 Ilo > spark(X), V6> e null(X), 0 ± 0. 

Proof. To derive a contradiction, assume that there exists a 0 e null(X), 6 ^ 0, such that \\0 1| 0 < 
spark(X). Because by definition X0 = 0, there exists a number of ||0|| o columns of X that are lin¬ 
early dependent. However, this contradicts the minimality of spark(X), and the claim of Lemma 9.2 is 
established. 

Lemma 9.3. If a linecir system of equations, X0 — y, has a solution that satisfies 


II 0 Ilo < - spark(X), 


then this is the sparsest possible solution. In other words, this is necessarily the unique solution ofthe 
£o minimizer. 

Proof. Consider any other solution h f 0. Then 0 — h e null(A), that is, 

X(0-h) = 0. 


Thus, according to Lemma 9.2, 


spark(Z) < ||6 — /t|| 0 < ||0|| o + ||/i|| 0 . (9.24) 

Observe that although the Iq “norm” is not a true norm, it can be readily verified by simple inspection 
and reasoning that the triangular property is satisfied. Indeed, by adding two vectors together, the 
resulting number of nonzero elements will always be at most equal to the total number of nonzero 
elements of the two vectors. Therefore, if ||0|| o < ^ spark(X), then (9.24) suggests that 

Pilo > ^spark(X) > ||0|| Q . 


Remarks 9.6. 

• Lemma 9.3 is a very interesting resuit. We have a sufficient condition to check whether a solution 
is the unique optimal in a generally NP-hard problem. Of course, although this is nice from a 
theoretical point of view, it is not of much use by itself, because the related bound (the spark) can 
only be obtained after a combinatorial search. In the next section, we will see that we can relax the 
bound by involving another index in place of the spark, which is easier to be computed. 

• An obvious consequence of the previous lemma is that if the unknown parameter vector is a sparse 
one with k nonzero elements, then if matrix X is chosen in order to have spark(Z) > 2 k, the true 
parameter vector is necessarily the sparsest one that satisfies the set of equations, and the (unique) 
solution to the £q minimizer. 
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• In practice, the goal is to sense the unknown parameter vector by a matrix that has a spark as high 
as possible, so that the previously stated sufficiency condition covers a wide range of cases. For 
example, if the spark of the input matrix is equal to three, then one can check for optimal sparse 
Solutions up to a sparsity level of k = 1. From the respective definition, it is easily seen that the 
values of the spark are in the range 1 < spark (A") < N + 1. 

• Constructing an N x / matrix X in a random manner, by generating i.i.d. entries, guarantees with 
high probability that spark (A - ) = N + I; that is, any N columns of the matrix are linearly indepen- 
dent. 

9.6.1 MUTUAL COHERENCE 

Because the spark of a matrix is a number that is difficult to compute, our interest shifts to another 
index, which can be derived more easily and at the same time offers a useful bound on the spark. The 
mutual coherence of an /V x / matrix X [65], denoted as /i(X), is defined as 



cT c 

x c i x i 


^ /V J •- 111 CIA 

!<!</</ 


x c 

j 



(9.25) 


where x?, i = 1,2,..., Z, denote the columns of X (note the difference in notation between a row x j 
and a column jcf of the matrix X). This number reminds us of the correlation coefficient between two 
random variables. Mutual coherence is bounded as 0 < /i(X) < 1. For a square orthogonal matrix X, 
/i(X ) — 0. For general matrices with / > /V, /i(X) satisfies 


I l-N 
N(l-l) 


<fi(X) < 1, 


which is known as the Welch bound [98] (Problem 9.15). For large values of/, the lower bound becomes 
approximately /i(X) > -^=. Common sense reasoning guides us to construet input (sensing) matrices 
of mutual coherence as small as possible. Indeed, the purpose of the sensing matrix is to “measure” the 
components of the unknown vector and “store” this information in the measurement vector y. Thus, this 
should be done in such a way that y retains as much information about the components of 0 as possible. 
This can be achieved if the columns of the sensing matrix X are as “independent” as possible. Indeed, 
y is the resuit of a combination of the columns of X, each one weighted by a different component of 0. 
Thus, if the columns are as “independent” as possible, then the information regarding each component 
of 0 is contributed by a different direction, making its recovery easier. This is easier understood if X 
is a square orthogonal matrix. In the more general case of a nonsquare matrix, the columns should be 
made as “orthogonal” as possible. 

Example 9.4. Assume that X is an N x 2 N matrix, formed by concatenating two orthonormal bases 
together, 


X = [/, W], 
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where / is the identity matrix, having as columns the vectors e/,= 1,2,..., /V, with elements equal 
to 

[ 0, if i^r, 

for r = 1,2 ,.N. The matrix W is the orthonormal DFT matrix, defined as 


where 


W = 


Vn 


i 

w N 


W. 


1 

N -1 


N 


1 W, 


N -1 
N 


W, 


(N-D(N-l) 

N 


Wn := exp 



For example, such an overcomplete dictionary could be used in the expansion (9.16) to represent signal 
vectors, which comprise the sum of sinusoids with spiky-like signal pulses. The inner products between 
any two columns of I and between any two columns of W are zero, due to orthogonality. On the other 
hand, it is easy to see that the inner product between any column of I and any column of W has absolute 
value equal to -j—. Hence, the mutual coherence of this matrix is n(X) = Moreover, observe that 
the spark of this matrix is spark (A) = N + 1. 


Lemma 9.4. For any N x / matrix X, the following inequality holds: 


1 

spark(X) > 1 + ——. 

MW 


(9.26) 


The proof is given in [37] and it is based on arguments that stem from matrix theory applied on the 
Gram matrix, X T X, of X (Problem 9.16). A “superficial” look at the previous bound is that for very 
small values of /i(X) the spark can be larger than N + 1! Looking at the proof, it is seen that in such 
cases the spark of the matrix attains its maximum value N + 1. 

The resuit complies with common sense reasoning. The smaller the value of /i(X), the more in- 
dependent the columns of X, hence the higher the value of its spark is expected to be. Based on this 
lemma, we can now state the following theorem, first given in [37]. Combining Lemma 9.3 and (9.26), 
we come to the following important theorem. 


Theorem 9.1. Ifthe linear system of equations in (9.17) has a solutiori that satisfies the condition 


«Ilo < 


1 

2 



(9.27) 


then this solution is the sparsest one. 
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Remarks 9.7. 

• The bound in (9.27) is “psychologically” important. It relates an easily computed bound to check 
whether the solution to an NP-hard task is the optimal one. However, it is not a particularly good 
bound and it restricts the range of values in which it can be applied. As it was discussed in Exam- 
ple 9.4, while the maximum possible value of the spark of a matrix was equal to N + 1, the minimum 
possible value of the mutual coherence was -)=. Therefore, the bound based on the mutual coher- 

ence restricts the range of sparsity, that is, ||0|| o , where one can check optimality, to around \\Z~N. 
Moreover, as the previously stated Welch bound suggests, this O () dependence of the mutual 
coherence seems to be a more general trend and not only the case for Example 9.4 (see, e.g., [36]). 
On the other hand, as we have already stated in the Remarks 9.6, one can construet random matrices 
with spark equal to N + 1; hence, using the bound based on the spark, one could expand the range 
of sparse vectors up to \ N. 


9.7 EQUIVALENCE OF l 0 AND h MINIMIZERS: SUFFICIENCY CONDITIONS 


We have now come to the crucial point where we will establish the conditions that guarantee the 
equivalence between the l\ and Iq minimizers. Hence, under such conditions, a problem that is in 
general an NP-hard one can be solved via a tractable convex optimization task. Under these conditions, 
the zero value encouraging nature of the l\ norm, which has already been discussed, obtains a much 
higher stature; it provides the sparsest solution. 


9.7.1 CONDITION IMPLIED BY THE MUTUAL COHERENCE NUMBER 

Theorem 9.2. Consider the underdetermined system of equations 


y = X0 , 

where X is an N x / (N < l) full row rank matrix. If a solution exists and satisfies the condition 



(9.28) 


then this is the unique solution ofboth the £q and t\ minimizers. 

This is a very important theorem, and it was shown independently in [37,54]. Earlier versions of the 
theorem addressed the special case of a dictionary comprising two orthonormal bases [36,48]. A proof 
is also summarized in [15] (Problem 9.17). This theorem established, for the first time, what was until 
then empirically known: often, the l\ and £q minimizers resuit in the same solution. 

Remarks 9.8. 

• The theory that we have presented so far is very satisfying, because it offers the theoretical frame- 
work and conditions that guarantee uniqueness of a sparse solution to an underdetermined system of 
equations. Now we know that under certain conditions, the solution which we obtain by solving the 





452 CHAPTER 9 SPARSITY-AWARE LEARNING 


convex l\ minimization task, is the (unique) sparsest one. However, from a practical point of view, 
the theory, which is based on mutual coherence, does not teli the whole story and falis short in pre- 
dicting what happens in practice. Experimental evidence suggests that the range of sparsity levels, 
for which the Iq and l\ tasks give the same solution, is much wider than the range guaranteed by 
the mutual coherence bound. Hence, there is a lot of theoretical happening in order to improve this 
bound. A detailed discussion is beyond the scope of this book. In the next section, we will present 
one of these bounds, because it is the one that currently dominates the scene. For more details and 
a related discussion, the interested reader may consuit, for example, [39,49,50]. 

9.7.2 THE RESTRICTED ISOMETRY PROPERTY (RIP) 

Definition 9.3. For each integer k = 1,2,..., define the isometry constant 8k of an /V x / matrix X as 
the smallest number such that 


(1 -h)\\0\\ 2 2 < \\xe\\ 2 2 < (1 + 4)||01||: the RIP condition 


(9.29) 


holds true for all k-sparse vectors 0. 

This definition was introduced in [19]. We loosely say that matrix X obeys the RIP of order k if 
<5* is not too close to one. When this property holds true, it implies that the Euclidean norm of 0 is 
approximately preser\’ed after projecting it onto the rows of X. Obviously, if matrix X was orthonor- 
mal, we would have Sk = 0. Of course, because we are dealing with nonsquare matrices, this is not 
possible. However, the closer Sk is to zero, the closer to orthonormal all subsets of k columns of X are. 
Another viewpoint of (9.29) is that X preserves Euclidean distances between k-sparse vectors. Let us 
consider two k-sparse vectors, 9\, 0 2 , and apply (9.29) to their difference, 0 1 — 0 2 , which, in general, 
is a 2k-sparse vector. Then we obtain 


(1 - S 2k ) \\0l - 02 lll < ll*(01 - 02)11? < (1 + hk ) ||01 - 02»2 ■ (9-30) 

Thus, when 8 2 k is small enough, the Euclidean distance is preserved after projection in the lower- 
dimensional observations’ space. In words, if the RIP holds true, this means that searching for a sparse 
vector in the lower-dimensional subspace, R' v . formed by the observations, and not in the original 
I -dimensional space, one can stili recover the vector since distances are preserved and the target vector 
is not “confused” with others. After projection onto the rows of X, the discriminatory power of the 
method is retained. It is interesting to point out that the RIP is also related to the condition number 
of the Grammian matrix. In [6,19], it is pointed out that if X r denotes the matrix that results by con- 
sidering only r of the columns of X , then the RIP in (9.29) is equivalent to requiring the respective 
Grammian, Xj X r , r < k, to have its eigenvalues within the interval [1 — 8k, 1 + 5*]. Hence, the more 
well-conditioned the matrix is, the better we dig out the information hidden in the lower-dimensional 
space. 

Theorem 9.3. Assume that for some k, 8 2 k < s/2 — 1. Then the solution to the l\ minimizer of (9.21 ), 
denoted as 0*, satisfies the following two conditions: 

l|0-0*lli<C o ||0-0*||i 


(9.31) 
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and 

I|0-0*I| 2 <CoH \\6-6kh, (9.32) 

for some constant Co- In the previously stated'formulas, 0 is the true (target) vector that generates the 
observations in (9.21) and Ok is the vector that results from 0 ifwe keep its k largest components and 
set the rest equal to zero [18,19,22,23]. 

Hence, if the true vector is a sparse one, that is, 0 — 0^, then the l\ minimizer recovers the (unique) 
exact value. On the other hand, if the true vector is not a sparse one, then the minimizer results in 
a solution whose accuracy is dictated by a genie-aided procedure that knew in advance the locations 
of the k largest components of 0. This is a groundbreaking resuit. Moreover, it is deterministic; it is 
always true, not with high probability. Note that the isometry property of order 2k is used, because at 
the heart of the method lies our desire to preserve the norm of the differences between vectors. 

Let us now focus on the case where there is a k-sparse vector that generates the observations, that is, 
0 —0J,. Then it is shown in [ 1 8] that the condition S 2 A- < 1 guarantees that the f j minimizer has a unique 
k-sparse solution. In other words, in order to get the equivalence between the l\ and £q minimizers, the 
range of values for & 2 k has to be decreased to <5 2 * < a/ 2 — 1, according to Theorem 9.3. This sounds 
reasonable. If we relax the criterion and use l\ instead of £q, then the sensing matrix has to be more 
carefully constructed. Although we will not provide the proofs of these theorems here, because their 
formulation is well beyond the scope of this book, it is interesting to follow what happens if & 2 k— 1 ■ 
This will give us a flavor of the essence behind the proofs. If & 2 k= 1, the left-hand side term in (9.30) 
becomes zero. In this case, there may exist two k-sparse vectors 0 \ .02 such that X{0\ — 02) — 0, 
or X0[ — X02- Thus, it is not possible to recover ali k-sparse vectors, after projecting them in the 
observations space, by any method. 

The previous argument also establishes a connection between RIP and the spark of a matrix. Indeed, 
if S 2/ t < 1, this guarantees that any number of columns of X up to 2k are linearly independent, because 
for any 2k-sparse 0, (9.29) guarantees that ||X0|| 2 > 0. This implies that spark (A) > 2 k. A connection 
between RIP and the coherence is established in [16], where it is shown that if X has coherence /x(X), 
and unit norm columns, then X satisfies the RIP of order k with Sk, where Sk < (k — 1 )/i(X ). 

Constructing Matrices That Obey the RIP of Order k 

It is apparent from our previous discussion that the higher the value of k, for which the RIP property of 
a matrix X holds true, the better, since a larger range of sparsity levels can be handled. Hence, a main 
goal toward this direction is to construet such matrices. It turns out that verifying the RIP for a matrix 
of a general structure is a difficult task. This reminds us of the spark of the matrix, which is also a 
difficult task to compute. However, it turns out that for a certain class of random matrices, the RIP can 
be established in an affordable way. Thus, constructing such sensing matrices has dominated the scene 
of related research. We will present a few examples of such matrices, which are also very popular in 
practice, without going into details of the proofs, because this is beyond the scope of this book. The 
interested reader may find this information in the related references. 

Perhaps the most well-known example of a random matrix is the Gaussian one, where the entries 
X(i. j ) of the sensing matrix are i.i.d. realizations from a Gaussian PDF Af( 0, jt). Another popular 
example of such matrices is constructed by sampling i.i.d. entries from Bernoulli, or related, distribu- 
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tions 


or 


X(i,j) = 




1 

Vn’ 

i 

Vn’ 



with probability 
with probability 

with probability 
with probability 
with probability 


1 

2 ’ 

1 

2 ’ 

1 

6 ’ 

2 

3’ 

1 

6 ' 


Finally, one can adopt the uniform distribution and construet the columns of X by sampling uni- 
formly at random on the unit sphere in R ,v . It turns out that such matrices obey the RIP of order k with 
overwhelming probability, provided that the number of observations, N, satisfies the inequality 


N > Ck\n(l/k), 


(9.33) 


where C is some constant which depends on the isometry constant S^. In words, having such a matrix 
at our disposal, one can recover a k-sparse vector from N < l observations, where N is larger than the 
sparsity level by an amount controlled by inequality (9.33). More on these issues can be obtained from, 
for example, [6,67]. 

Besides random matrices, one can construet other matrices that obey the RIP. One such example 
includes the partial Fourier matrices, which are formed by selecting uniformly at random N rows drawn 
from the / x / DFT matrix. Although the required number of samples for the RIP to be satisfied may be 
larger than the bound in (9.33) (see [79]), Fourier-based sensing matrices offer certain computational 
advantages when it comes to storage (0(N\nI)) and matrix-vector products (0(1 ln/)) [20]. In [56], the 
case of random Toeplitz sensing matrices containing statistical dependencies across rows is considered, 
and it is shown that they can also satisfy the RIP with high probability. This is of particular importance 
in signal processing and Communications applications, where it is very common for a system to be 
excited in its input via a time series, and hence independence between successive input rows cannot be 
assumed. In [44,76], the case of separable matrices is considered where the sensing matrix is the resuit 
of a Kronecker product of matrices, which satisfy the RIP individually. Such matrices are of interest 
for multidimensional signals, in order to exploit the sparsity structure along each one of the involved 
dimensions. For example, such signals may occur while trying to “encode” information associated with 
an event whose activity spreads across the temporal, spectral, spatial, and other domains. 

In spite of their theoretical elegance, the derived bounds that determine the number of the required 
observations for certain sparsity levels fall short of the experimental evidence (e.g., [39]). In practice, 
a rule of thumb is to use N of the order of 3k-5k [18]. For large values of /, compared to the sparsity 
level, the analysis in [38] suggests that we can recover most sparse signals when N «a 2k ln(// N). In an 
effort to overcome the shortcomings associated with the RIP, a number of other techniques have been 
proposed (e.g., [1 1,30,39,85]). Furthermore, in specific applications, the use of an empirical study may 
be a more appropriate path. 
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Note that, in principle, the minimum number of observations that are required to recover a k-sparse 
vector from N < l observations is /V > 2k. Indeed, in the spirit of the discussion after Theorem 9.3, 
the main requirement that a sensing matrix must fulfill is not to map two different k-sparse vectors to 
the same measurement vector y. Otherwise, one can never recover both vectors from their (common) 
observations. If we have 2 k observations and a sensing matrix that guarantees that any 2 k columns 
are linearly independent, then the previously stated requirement is satisfied. However, the bounds on 
the number of observations set in order for the respective matrices to satisfy the RIP are larger. This 
is because RIP accounts also for the stability of the recovery process. We will come to this issue in 
Section 9.9, where we talk about stable embeddings. 


9.8 ROBUST SPARSE SIGNAL RECOVERY FROM NOISY MEASUREMENTS 

In the previous section, our focus was on recovering a sparse solution from an underdetermined Sys¬ 
tem of equations. In the formulation of the problem, we assumed that there is no noise in the obtained 
observations. Having acquired some experience and insight from a simpler scenario, we now turn our 
attention to the more realistic task, where uncertainties come into the scene. One type of uncertainty 
may be due to the presence of noise, and our observations’ model comes back to the Standard regres- 
sion form 


y — X6 + ri, (9.34) 

where X is our familiar nonsquare /V x / matrix. A sparsity-aware formulation for recovering 0 from 
(9.34) can be cast as 


min ||0||i, 

s.t. \\y-X0\\l<e, (9.35) 

which coincides with the LASSO task given in (9.8). Such a formulation implicitly assumes that the 
noise is bounded and the respective range of values is controlled by e. One can consider a number 
of different variants. For example, one possibility would be to minimize the ||-||o norm instead of the 
|| • || i, albeit losing the computational elegance of the latter. An alternative route would be to replace the 
Euclidean norm in the constraints with another one. 

Besides the presence of noise, one could see the previous formulation from a different perspective. 
The unknown parameter vector, 0, may not be exactly sparse, but it may consist of a few large compo- 
nents, while the rest are small and close to, yet not necessarily equal to, zero. Such a model misfit can 
be accommodated by allowing a deviation of y from XO. 

In this relaxed setting of a sparse solution recovery, the notions of uniqueness and equivalence 
concerning the £q and l \ Solutions no longer apply. Instead, the issue that now gains importance is 
that of stability of the solution. To this end, we focus on the computationally attractive t \ task. The 
counterpart of Theorem 9.3 is now expressed as follows. 

Theorem 9.4. Assume that the sensing matrix, X, obeys the RIP with &2k < V2 — 1, for some k. Then 
the solution 0 * of(9.35) satisfies the following ([22,23]): 
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\\0 - 0*|| 2 < Cok~? \\0 — O k \h + CiVe, (9.36) 

for some constants CCq, and Ok as defined in Theorem 9.3. 

This is also an elegant resuit. If the model is exact and e = 0, we obtain (9.32). If not, the higher 
the uncertainty (noise) term in the model, the higher our ambiguity about the solution. Note, also, that 
the ambiguity about the solution depends on how far the true model is from Ok- If the true model is 
^-sparse, the first term on the right-hand side of the inequality is zero. The values of C i, Cq depend on 
& 2 k but they are small, for example, close to live or six [23]. 

The important conclusion here is that adopting the i \ norm and the associated LASSO optimization 
for solving inverse problems (which in genera!, as we noted in Chapter 3, tend to be ill-conditioned) is 
a stable one and the noise is not amplified excessively during the recovery process. 


9.9 COMPRESSED SENSING: THE GLORY OF RANDOMNESS 

The way in which this chapter was deployed followed, more or less, the sequence of developments 
that took place during the evolution of the sparsity-aware parameter estimation field. We intentionally 
made an effort to follow such a path, because this is also indicative of how Science evolves in most 
cases. The starting point had a rather strong mathematical flavor: to develop conditions for the solution 
of an underdetermined linear system of equations, under the sparsity constraint and in a mathemati- 
cally tractable way, that is, using convex optimization. In the end, the accumulation of a sequence of 
individual contributions revealed that the solution can be (uniquely) recovered if the unknown quantity 
is sensed via randomly chosen data samples. This development has, in turn, given birth to a new field 
with strong theoretical interest as well as with an enormous impact on practical applications. This new 
emerged area is known as compressed sensing or compressive sampling (CS), and it has changed our 
view on how to sense and process signals efficiently. 

COMPRESSED SENSING 

In CS, the goal is to directly acquire as few samples as possible that encode the minimum information 
needed to obtain a compressed signal representation. In order to demonstrate this, let us return to 
the data compression example discussed in Section 9.4. There, it was commented that the “classical” 
approach to compression was rather unorthodox, in the sense that first all (i.e., a number of l) samples 
of the signal are used, and then they are processed to obtain / transformed values, from which only a 
small subset is used for coding. In the CS setting, the procedure changes to the following one. 

Let X be an N x l sensing matrix, which is applied to the (unknown) signal vector, s, in order 
to obtain the observations, y , and let T be the dictionary matrix that describes the domain where the 
signal s accepts a sparse representation, that is, 


s = 6 , 


(9.37) 
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Assuming that at most k of the components of 6 are nonzero, this can be obtained by the following 
optimization task: 

min ||0|| j, 

0sR' 

s.t. y = XV6, (9.38) 

provided that the combined matrix X T complies with the RIP, and the number of observations, N, is 
large enough, as dictated by the bound in (9.33). Note that s needs not be stored and can be obtained any 
time, once 0 is known. Moreover, as we will soon discuss, there are techniques that allow observations, 
y n , n = \,2,..., N, to be acquired directly from an analog signal s(t), prior to obtaining its sample 
(vector) version, s! Thus, from such a perspective, CS fuses the data acquisition and the compression 
steps together. 

There are different ways to obtain a sensing matrix, X, that lead to a product X T, which satisfies 
the RIP. It can be shown (Problem 9.19) that if VP is orthonormal and A is a random matrix, which is 
constructed as discussed at the end of Section 9.7.2, then the product X T obeys the RIP, provided that 
(9.33) is satisfied. An alternative way to obtain a combined matrix that respects the RIP is to consider 
another orthonormal matrix <I>. whose columns have low coherence with the columns of *P (coherence 
between two matrices is defined in (9.25), where now the place of x' is taken by a column of <I> and 
that of x c j by a column of 'I'). For example, <f> could be the DFT matrix and <P = /, or vice versa. 
Then choose N rows of <J> uniformly at random to form X in (9.37). In other words, for such a case, 
the sensing matrix can be written as R<S>, where K is an N x l matrix that extracts N rows uniformly 
at random. The notion of incoherence (low coherence) between the sensing and the basis matrices is 
closely related to RIP. The more incoherent the two matrices, the lower the number of the required 
observations for the RIP to hold (e.g., [21,79]). Another way to view incoherence is that the rows of <t> 
cannot be sparsely represented in terms of the columns of VP. It turns out that if the sensing matrix X is 
a random one, formed as described in Section 9.7.2, then the RIP and the incoherence with any VP are 
satisfied with high probability. 

It gets even better when we say that ali the previously stated philosophy can be extended to the more 
general type of signals, which are not necessarily sparse or sparsely represented in terms of the atoms 
of a dictionary, and they are known as compressible. A signal vector is said to be compressible if its 
expansion in terms of a basis consists of just a few large parameters 0 t and the rest are small. In other 
words, the signal vector is approximately sparse in some basis. Obviously, this is the most interesting 
case in practice, where exact sparsity is scarcely (if ever) met. Reformulating the arguments used in 
Section 9.8, the CS task for this case can be cast as 

min ||01| i, 
flsR' 

s.t. \\y-XV0\\l<€, ( 9 . 39 ) 

and everything that has been said in Section 9.8 is also valid for this case, if in place of X we consider 
the product X'!'. 

Remarks 9 . 9 . 

• An important property in CS is that the sensing matrix, which provides the observations, may be 
chosen independently on the matrix T, that is, the basis/dictionary in which the signal is sparsely 
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represented. In other words, the sensing matrix can be “universal” and can be used to provide the 
observations for reconstructing any sparse or sparsely represented signal in any dictionary, provided 
the RIP is not violated. 

• Each observation, y n , is the resuit of an inner product of the signal vector with a row xf, of the 
sensing matrix X. Assuming that the signal vector s is the resuit of a sampling process on an analog 
signal, s(t), y n can be directly obtained, to a good approximation, by taking the inner product 
(integral) of s(t) with a sensing waveform, x„(t), that corresponds to x n . For example, if X is 
formed by ±1, as described in Section 9.7.2, then the configuration shown in Fig. 9.10 results in y n . 
An important aspect of this approach, besides avoiding computing and storing the / components 
of s, is that multiplying by ±1 is a relatively easy operation. It is equivalent to changing the polarity 
of the signal and it can be implemented by employing inverters and mixers. It is a process that can be 
performed, in practice, at much higher rates than sampling. The sampling system shown in Fig. 9.10 
is referred to as random demodulator [58,91]. It is one among the popular analog-to-digital (A/D) 
conversion architectures, which exploit the CS rationale in order to sample at rates much lower than 
those required for classical sampling. We will come back to this soon. 

One of the very frrst CS-based acquisition systems was an imaging system called the one-pixel 
camera [84], which followed an approach resembling the conventional digital CS. According to 
this, light of an image of interest is projected onto a random base generated by a micromirror device. 
A sequence of projected images is collected by a single photodiode and used for the reconstruction 
of the full image using conventional CS techniques. This was among the most catalytic examples 
that spread the rumor about the practical power of CS. CS is an example of common wisdom: “There 
is nothing more practical than a good theory!” 

9.9.1 DIMENSIONALITY REDUGTION AND STABLE EMBEDDINGS 

We will now shed light on what we have said so far in this chapter from a different point of view. In 
both cases, either when the unknown quantity was a ^-sparse vector in a high-dimensional space, R/, 
or when the signal s was (approximately) sparsely represented in some dictionary (s = T0), we chose 
to work in a lower-dimensional space (R^), that is, the space of the observations, y. This is a typical 
task of dimensionality reduction (see Chapter 19). The main task in any (linear) dimensionality reduc- 
tion technique is to choose the proper matrix X, which dictates the projection to the lower-dimensional 
space. In general, there is always a loss of information by projecting from R / to R A \ with N < l, in 



FIGURE 9.10 


Sampling an analog signal s(t) in order to generate the sample/observation y n at time instant n. The sampling 
period T s is much shorter than that required by the Nyquist sampling. 
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the sense that we cannot recover any vector 6 / e R / from its projection 0^ e R ,v . Indeed, take any 
vector e null(X) that lies in the (/ — /V (-dimensional null space of the (full row rank) X (see Sec- 
tion 9.5). Then all vectors 0/ + @l—N £ share the same projection in R ,v . However, what we have 
discovered in this chapter is that if the original vector is sparse, then we can recover it exactly. This 
is because all the /.'-sparse vectors do not lie anywhere in R ; , but rather in a subset of it, that is, in a 
union of subspaces, each one having dimensionality k. If the signal s is sparse in some dictionary 'I', 
then one has to search for it in the union of all possible ^--dimensional subspaces of R ; , which are 
spanned by />column vectors from 'k [8,62]. Of course, even in this case, where sparse vectors are in- 
volved, no projection can guarantee unique recovery. The guarantee is provided if the projection in the 
lower-dimensional space is a stable embedding. A stable embedding in a lower-dimensional space must 
guarantee that if $ i f 0 2 , then their projections also remain different. Yet this is not enough. A stable 
embedding must guarantee that distances are (approximately) preserved; that is, vectors that lie far 
apart in the high-dimensional space have projections that also lie far apart. Such a property guaran- 
tees robustness to noise. The sufficient conditions, which have been derived and discussed throughout 
this chapter, and guarantee the recovery of a sparse vector lying in W from its projections in R w , are 
conditions that guarantee stable embeddings. The RIP and the associated bound on N provide a con- 
dition on X that leads to stable embeddings. We commented on this norm preserving property of RIP 
in the related section. The interesting fact that came from the theory is that we can achieve such stable 
embeddings via random projection matrices. 

Random projections for dimensionality reduction are not new and have extensively been used in pat- 
tern recognition, clustering, and data mining (see, e.g., [1,13,34,82,87]). The advent of the big data era 
resparked the interest in random projection-aided data analysis algorithms (e.g., [55,81]) for two major 
reasons. The first is that data processing is computationally lighter in the lower-dimensional space, 
because it involves operations with matrices or vectors represented with fewer parameters. Moreover, 
the projection of the data to lower-dimensional spaces can be realized via well-structured matrices at 
a computational cost significantly lower compared to that implied by general matrix-vector multipli- 
cations [29,42]. The reduced computational power required by these methods renders them appealing 
when dealing with excessively large data volumes. The second reason is that there exist randomized 
algorithms which access the data matrix a (usually fixed) number of times that is much smaller than the 
number of accesses performed by ordinary methods [28,55]. This is very important whenever the full 
amount of data does not fit in fast memory and has to be accessed in parts from slow memory devices, 
such as hard discs. In such cases, the computational time is often dominated by the cost of memory 
access. 

The spirit underlying CS has been exploited in the context of pattern recognition too. In this appli- 
cation, one need not return to the original high-dimensional space after the information digging activity 
in the low-dimensional subspace. Since the focus in pattern recognition is to identify the class of an 
object/pattern, this can be performed in the observations subspace, provided that there is no class- 
related information loss. In [17], it is shown, using compressed sensing arguments, that if the data are 
approximately linearly separable in the original high-dimensional space and the data have a sparse rep- 
resentation, even in an unknown basis, then projecting randomly in the observations subspace retains 
the structure of linear separability. 

Manifold learning is another area where random projections have also applied. A manifold is, in 
general, a nonlinear /.'-dimensional surface, embedded in a higher-dimensional (ambient) space. For 
example, the surface of a sphere is a two-dimensional manifold in a three-dimensional space. In [7,96], 
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the compressed sensing rationale is extended to signal vectors that live along a /c-dimensional subman- 
ifold of the space W. It is shown that if choosing a matrix X to project and a sufficient number N of 
observations, then the corresponding submanifold has a stable embedding in the observations subspace, 
under the projection matrix X\ that is, pair-wise Euclidean and geodesic distances are approximately 
preserved after the projection mapping. More on these issues can be found in the given references and 
in, for example, [ 8 ]. We will come to the manifold learning task in Chapter 19. 

9.9.2 SUB-NYQUIST SAMPLING: ANALOG-TO-INFORMATION CONVERSION 

In our discussion in the remarks presented before, we touched on a very important issue—that of going 
from the analog domain to the discrete one. The topic of A/D conversion has been at the forefront 
of research and technology since the seminal works of Shannon, Nyquist, Whittaker, and Kotelnikof 
were published; see, for example, [92] for a thorough related review. We ali know that if the highest 
frequency of an analog signal s(t) is less than F / 2, then Shannon’s theorem suggests that no loss of 
information is achieved if the signal is sampled, at least, at the Nyquist rate of F = l/T, where T is 
the corresponding sampling period, and the signal can be perfectly recovered by its samples 

s(t) = E s(nT) sinc(Ff — n), 

n 


where sine is the sampling function 


sinc(f) = 


sin(7rf) 

TCt 


While this has been the driving force behind the development of signal acquisition devices, the in- 
creasing complexity of emerging applications demands increasingly higher sampling rates that cannot 
be accommodated by today’s hardware technology. This is the case, for example, in wideband Com¬ 
munications, where conversion speeds, as dictated by Shannon’s bound, have become more and more 
difficult to obtain. Consequently, alternatives to high-rate sampling are attracting strong interest, with 
the goal of reducing the sampling rate by exploiting the underlying structure of the signals at hand. For 
example, in many applications, the signal comprises a few frequencies or bands; see Fig. 9.11 for an il- 
lustration. In such cases, sampling at the Nyquist rate is inefficient. This is an old problem investigated 
by a number of authors, leading to techniques that allow low-rate sampling whenever the locations of 
the nonzero bands in the frequency spectrum are known (see, e.g., [61,93,94]). CS theory has inspired 
research to study cases where the locations (carrier frequencies) of the bands are not known a priori. 
A typical application of this kind, of high practical interest, lies within the field of cognitive radio (e.g., 
[68,88,103]). 

The process of sampling an analog signal with a rate lower than the Nyquist one is referred to as 
analog-to-information sampling or sub-Nyquist sampling. Let us focus on two among the most popular 
CS-based A/D converters. The first is the random demodulator (RD), which was first presented in 
[58] and later improved and theoretically developed in [91 ]. RD in its basic configuration is shown in 
Fig. 9.10, and it is designed for acquiring at sub-Nyquist rates sparse multitone signals, that is, signals 
having a sparse DFT. This implies that the signal comprises a few frequency components, but these 
components are constrained to correspond to integral frequencies. This limitation was pointed out in 
[91], and potential Solutions have been sought according to the general framework proposed in [24] 
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FIGURE 9.11 

The Fourier transform of an analog signal, s(t), which is sparse in the frequency domain; only a limited number of 
frequency bands contribute to its spectrum content S(Q), where £2 stands for the angular frequency. Nyquist’s the- 
ory guarantees that sampling at a frequency larger than or equal to twice the maximum £2 max is sufficient to recover 
the original analog signal. However, this theory does not exploit information related to the sparse structure of the 
signal in the frequency domain. 


and/or the heuristic approach described in [45]. Moreover, more elaborate RD designs, such as the 
random-modulation preintegrator (RMP1) [102], have the potential to deal with signals that are sparse 
in any domain. 

Another CS-based sub-Nyquist sampling strategy that has received much attention is the modulated 
wideband converter (MWC) [68,69,71]. The MWC is very efficient in acquiring multiband signals 
such as the one depicted in Fig. 9.11. This concept has also been extended to accommodate signals 
with different characteristics, such as signals consisting of short pulses [66]. An in-depth investigation 
which sheds light on the similarities and differences between the RD and MWC sampling architectures 
can be found in [59]. 

Note that both RD and MWC sample the signal uniformly in time. In [97], a different approach is 
adopted, leading to much easier implementations. In particular, the preprocessing stage is avoided and 
nonuniformly spread in time samples are acquired directly from the raw signal. In total, fewer samples 
are obtained compared to Nyquist sampling. Then CS-based reconstruction is mobilized in order to 
recover the signal under consideration based on the values of the samples and the time information. 
Like in the basic RD case, the nonuniform sampling approach is suitable for signals sparse in the DFT 
basis. From a practical point of view, there are stili a number of hardware implementation-related issues 
that more or less concern all the approaches above and need to be solved (see, e.g., [9,25,63]). 

An alternative path to sub-Nyquist sampling embraces a different class of analog signals, known 
as multipulse signals, that is, signals that consist of a stream of short pulses. Sparsity now refers to 
the time domain, and such signals may not even be bandlimited. Signals of this type can be met in 
a number of applications, such as radar, ultrasound, bioimaging, and neuronal signal processing (see, 
e.g., [41]). An approach known as finite rate of innovation sampling passes an analog signal having k 
degrees of freedom per second through a linear time-invariant filter, and then samples at a rate of 2k 
samples per second. Reconstruction is performed via rooting a high-order polynomial (see, e.g., [12, 
95] and the references therein). In [66], the task of sub-Nyquist sampling is treated using CS theory 
arguments and an expansion in terms of Gabor functions; the signal is assumed to consist of a sum of 
a few pulses of finite duration, yet of unknown shape and time positions. More on this topic can be 
obtained in [43,51,70] and the references therein. 
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FIGURE 9.12 

(A) Noiseless case. The values of the true vector, which generated the data for Example 9.5, are shown with stems 
topped with open circles. The recovered points are shown with squares. An exact recovery of the signal has been ob- 
tained. The stems topped with gray-filled circles correspond to the minimum Euclidean norm LS solution. (B) This 
figure corresponds to the noisy counterpart of that in (A). In the presence of noise, exact recovery is not possible and 
the higher the variance of the noise, the less accurate the results. 


Example 9.5. We are given a set of N = 20 observations stacked in the y e R' v vector. These were 
taken by applying a sensing matrix X on an “unknown” vector in R 50 , which is known to be sparse 
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with k = 5 nonzero components; the location of these nonzero components in the unknown vector is 
not known. The sensing matrix was a random matrix with elements drawn from a normal distribution 
Af( 0, 1), and then the columns were normalized to unit norm. There are two scenarios for the observa- 
tions. In the first one, we are given their exact values while in the second one, white Gaussian noise of 
variance a 1 — 0.025 was added. 

In order to recover the unknown sparse vector, the CS matching pursuit (CoSaMP, Chapter 10) 
algorithm was used for both scenarios. 

The results are shown in Fig. 9.12A and B for the noiseless and noisy scenarios, respectively. The 
values of the true unknown vector 0 are represented with black stems topped with open circles. Note 
that all but five of them are zero. In Fig. 9.12A, exact recovery of the unknown values is achieved; 
the estimated values of 0,-, i = 1,2..., 50, are indicated with squares in red color. In the noisy case of 
Fig. 9.12B, the resulting estimates, which are denoted with squares, deviate from the correct values. 
Note that estimated values very close to zero (|0| < 0.01) have been omitted from the figure in order to 
facilitate visualization. In both figures, the stemmed gray-filled circles correspond to the minimum £2 
norm LS solution. The advantages of adopting a sparsity promoting approach to recover the solution 
are obvious. The CoSaMP algorithm was provided with the exact number of sparsity. The reader is 
advised to reproduce the example and play with different values of the parameters and see how results 
are affected. 


9.10 A CASE STUDY: IMAGE DENOISING 

We have already discussed CS as a notable application of sparsity-aware learning. Although CS has 
acquired a lot of fame, a number of classical signal processing and machine learning tasks lend them- 
selves to efficient modeling via sparsity-related arguments. Two typical examples are the following. 

• Denoising: The problem in signal denoising is that instead of the actual signal samples, y, a noisy 
version of the corresponding observations, y, is available; that is, y = y + r), where r/ is the vector of 
noise samples. Under the sparse modeling framework, the unknown signal y is modeled as a sparse 
representation in terms of a specific known dictionary T, that is, y — ^>0. Moreover, the dictionary 
is allowed to be redundant (overcomplete). Then the denoising procedure is realized in two steps. 

First, an estimate of the sparse representation vector, 0, is obtained via the £0 norm minimizer or 
via any LASSO formulation, for example, 


0 = argmin ||0||i, 

0eR' 

(9.40) 

s.t. ||y - 'I' 0||2 < e. 

(9.41) 


Second, the estimate of the true signal is computed as y = Vb0. In Chapter 19, we will study the 
case where the dictionary is not fixed and known, but is estimated from the data. 

• Linear inverse problems : Such problems, which come under the more general umbrella of what is 
known as signal restoration , go one step beyond denoising. Now, the available observations are 
distorted as well as noisy versions of the true signal samples; that is, y = Hy + >/, where H is a 
known linear operator. For example, H may correspond to the blurring point spread function of an 
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Original i mage 


Image + noise (PSNR = 22) De-noised image (PSNR = 28.2) 



(A) 


(B) 


(C) 


FIGURE 9.13 

Denoising based on sparse and redundant representations. 


image, as discussed in Chapter 4. Then, similar to the denoising example, assuming that the original 
signal samples can be efficiently represented in terms of an overcomplete dictionary, 6 is estimated, 
via any sparsity promoting method, using //T in place of '4' in (9.41), and the estimate of the true 
signal is obtained as y = ^0. 

Besides deblurring, other applications that fall under this formulation include image inpainting, 
if H represents the corresponding sampling mask; inverse-Radon transform in tomography, if H 
comprises the set of parallel projections, and so on. See, for example, [49] for more details on this 
topic. 

In this case study, the image denoising task, based on the sparse and redundant formulation as 
discussed above, is explored. Our starting point is the 256 x 256 image shown in Fig. 9.13A. In the 
sequel, the image is corrupted by zero mean Gaussian noise, leading to the noisy version of Fig. 9.1 3B, 
corresponding to a peak signal-to-noise ratio (PSNR) equal to 22 dB, which is defined as 


p SN R = 201 o gll) (-^i), (9.42, 

where m / is the maximum pixel value of the image and MSE = || I — 7 || 2 p , where I and / are the 

noisy and original image matrices, N p is equal to the total number of pixels, and the Frobenius norm 
for matrices has been employed. 

Denoising could be applied to the full image at once. However, a more efficient practice with respect 
to memory consumption is to split the image to patehes of size much smaller than that of the image; for 
our case, we chose 12 x 12 patehes. Then denoising is performed to each pateh separately as follows: 
The /th pateh image is reshaped in lexicographic order forming a one-dimensional vector, y,- e M 144 . 
We assume that each one of the patehes can be reproduced in terms of an overcomplete dictionary, as 
discussed before; hence, the denoising task is equivalently formulated around (9.40)—(9.41). Denote 
by fi the /th pateh of the noise-free image. What is left is to choose a dictionary 4, which sparsely 
represents y ; , and then solve for sparse 0 according to (9.40)-(9.41). 
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FIGURE 9.14 

2D-DCT dictionary atoms, corresponding to 12 x 12 patch size. 


It is known that images often exhibit sparse DCT transforms, so an appropriate choice for the 
dictionary is to fili the columns of T with atoms of a redundant 2D-DCT reshaped in lexicographic 
order [49]. Here, 196 such atoms were used. There is a Standard way to develop such a dictionary 
given the dimensionality of the image, described in Exercise 9.22. The scime dictionary is used for all 
patches. The atoms of the dictionary, reshaped to form 12 x 12 blocks, are depicted in Fig. 9.14. 

A question that naturally arises is how many patches to use. A straightforward approach is to tile 
the patches side by side in order to cover the whole extent of the image. This is feasible; however, it is 
likely to resuit in blocking effects at the edges of several patches. A better practice is to let the patches 
overlap. During the reconstruction phase (y — 4'0), because each pixel is covered by more than one 
patch, the final value of each pixel is taken as the average of the corresponding predicted values from 
all the involved patches. The results of this method, for our case, are shown in Fig. 9.13C. The attained 
PSNR is higher than 28 dB. 


PROBLEMS 

9.1 If x/ ,y,, i = 1,2./, are real numbers, then prove the Cauchy-Schwarz inequality. 



9.2 Prove that the £2 (Euclidean) norm is a true norm, that is, it satisfies the four conditions that 
define a norm. 

Hint : To prove the triangle inequality, use the Cauchy-Schwarz inequality. 
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9.3 Prove that any function that is a norm is also a convex function. 

9.4 Show Young’s inequality for nonnegative real numbers a and b, 

aP b« 
ab < -1-, 

p q 

for oo > p > 1 and oo > q > 1 such that 

1 1 

—I— = 1. 

p q 

9.5 Prove Holder’s inequality for l p norms, 

/ / / \ l /P / q \ !/? 

Il-^ r 3'll 1 = < ||x||p||y|| 9 = . 

for p > 1 and q > 1 such that 

1 1 

—I— — 1. 

p q 

Hint: Use Young’s inequality. 

9.6 Prove Minkowski’s inequality, 

il \ l/ P / 1 \ l,p / i \ 1/p 

(E(M + \yi\) p J <(EM P ) - 

for p > 1. 

Hint: Use Holder’s inequality together with the identity 

(M + \b\) p = (|fl| + \b\Y~ l \a\ + (|fl| + \b\Y~'\b\. 

9.7 Prove that for p > 1, the l p norm is a true norm. 

9.8 Use a counterexample to show that any l p norm for 0 < p < 1 is not a true norm and it violates 
the triangle inequality. 

9.9 Show that the null space of a full row rank N x I matrix X is a subspace of dimensionality /V, 
for N <1. 

9.10 Show, using Lagrange multipliers, that the t j minimizer in (9.18) accepts the closed form solu- 
tion 

0 = X T (XX T ) 1 y. 

9.11 Show that the necessary and sufficient condition for a 6 to be a minimizer of 

minimize ||0||i, (9.43) 

subject to XO = y (9.44) 
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is 


^2 sign (di)zi < ^2 l z «'l’ Vzenull(X), 

i :9i 7^0 i :0i =0 

where n u 11 (X) is the null space of X. Moreover, if the minimizer is unique, the previous inequal- 
ity becomes a striet one. 

9.12 Prove that if the i\ norm minimizer is unique, then the number of its components, which are 
identically zero, must be at least as large as the dimensionality of the null space of the corre- 
sponding input matrix. 

9.13 Show that the l \ norm is a convex function (as all norms), yet it is not strictly convex. In contrast, 
the squared Euclidean norm is a strictly convex function. 

9.14 Construet in the five-dimensional space a matrix that has (a) rank equal to five and spark equal 
to four, (b) rank equal to five and spark equal to three, and (c) rank and spark equal to four. 

9.15 Let X be a full row rank N x / matrix, with l > N. Derive the Welch bound for the mutual 
coherence /jl(X), 


/x(A)> 


I l-N 
N{i-\y 


(9.45) 


9.16 Let X be an N x / matrix. Then prove that its spark is bounded as 


1 

spark(A) > 1 + ——, 
IJ-(X) 


where fi(X) is the mutual coherence of the matrix. 

Hint: Consider the Gram matrix X T X and the following theorem, concerning positive definite 
matrices: An m x m matrix A is positive definite if 


m 

\A(i, i)\ > ^2 Vi = 1,2, ...,m 

j=hj& 


(see, for example, [57]). 

9.17 Show that if the underdetermined system of equations y — X0 accepts a solution such that 

||0||o < - [ 1 + —— 1, 

2 V #)/ 

then the t\ minimizer is equivalent to the £q one. Assume that the columns of X are normalized. 

9.18 Prove that if the RIP of order k is valid for a matrix X and 8^ < 1, then any m < k columns of X 
are necessarily linearly independent. 

9.19 Show that if X satisfies the RIP of order k and some isometry constant <5^, so does the product 
AT' if T' is an orthonormal matrix. 
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MATLAB® EXERCISES 

9.20 Consider an unknown 2-sparse vector 0 O , which when measured with the sensing matrix 

[0.5 2 1.5' 

2 2.3 3.5 ’ 


that is, of y — X0 o , gives y = [ I.25, 3.75] r . Perform the following tasks in MATLAB®. 
(a) Based on the pseudoinverse of X , compute 0i, which is the £2 norm minimized solution, 
(9.18). Next, check that this solution 0 2 leads to zero estimation error (up to machine precision). 
Is 02 a 2-sparse vector such as the true unknown vector 0 O , and if it is not, how is it possible to 
lead to zero estimation error? (b) Solve the £q minimization task described in (9.20) (exhaustive 
search) for all possible 1- and 2-sparse Solutions and get the best one, 0 O . Does 0„ lead to zero 
estimation error (up to machine precision)? (c) Compute and compare the £2 norms of 0i and 
0 0 . Which is the smaller one? Was this resuit expected? 

9.21 Generate in MATLAB® a sparse vector 0 e l 4 , l = 100, with its first five components taking 
random values drawn from a normal distribution Af ( 0, 1) and the rest being equal to zero. Build, 
also, a sensing matrix X with N — 30 rows having samples normally distributed, Af(0, 4 ,), in 
order to get 30 observations based on the linear regression model y — X0. Then perform the 
following tasks. (a) Use the function “solvelasso.m”, 4 or any other LASSO implementation you 
prefer, to reconstruet 0 from y and X. (b) Repeat the experiment with different realizations of 
X in order to compute the probability of correct reconstruction (assume the reconstruction is 
exact when ||y — X0\\2 < 10 -8 ). (c) Construet another sensing matrix X having N — 30 rows 
taken uniformly at random from the / x / DCT matrix, which can be obtained via the built-in 
MATLAB® function “dctmtx.m”. Compute the probability of reconstruction when this DCT- 
based sensing matrix is used and confirm that results similar to those in question (b) are obtained. 
(d) Repeat the same experiment with matrices of the form 


X(i,j) = 



0 , 



with probability 


2 VP 


with probability 1-, 

>fP 


with probability 


2 VP 


for p equal to 1,9, 25, 36, 64 (make sure that each row and each column of X has at least 
a nonzero component). Give an explanation why the probability of reconstruction falis as p 
increases (observe that both the sensing matrix and the unknown vector are sparse). 

9.22 This exercise reproduces the denoising results of the case study in Section 9.10, where the image 
depicting the boat can be downloaded from the book website. First, extract from the image all 
the possible sliding patehes of size 12x12 using the im2col.m MATLAB® function. Confirm 


4 It can be found in the SparseLab MATLAB® toolbox, which is freely available from http://sparselab.stanford.edu/. 
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that (256 — 12 + l) 2 = 60,025 patches in total are obtained. Next, a dictionary in which all the 
patches are sparsely represented needs to be designed. 

Specifically, the dictionary atoms are going to be those corresponding to the two-dimensional 
redundant DCT transform, and are obtained as follows [49]: 

a) Consider vectors d , = [c/;,i, di,2, ■ ■ ■, di,n] T , i =0,_13, being the sampled sinusoids of 

the form 

( t7t ' \ 

dij+\ = cos I — 1 , f = 0, ...,11. 

Then make a (12 x 14) matrix D, having as columns the vectors d, normalized to unit 
norm; D resembles a redundant DCT matrix. 

b) Construet the (12 2 x 14 2 ) dictionary T according to ^ = D <§Z) D, where denotes Kro- 
necker product. Built in this way, the resulting atoms correspond to atoms related to the 
overcomplete 2D-DCT transform [49] . 

As a next step, denoise each image pateh separately. In particular, assuming that y t is the 
ith pateh reshaped in column vector, use the function “solvelasso.m”,^ or any other suitable 
algorithm you prefer, to estimate a sparse vector 0, e R 1 % and obtain the corresponding 
denoised vector as y, = 'VBj. Finally, average the values of the overlapping patches in order 
to form the full denoised image. 
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10.1 INTRODUCTION 

This chapter is the follow-up to the previous one concerning sparsity-aware learning. The emphasis 
now is on the algorithmic front. Following the theoretical advances concerning sparse modeling, a true 
scientific happening occurred in trying to derive algorithms tailored for the efficient solution of the re- 
lated constrained optimization tasks. Our goal is to present the main directions that have been followed 
and to provide in a more explicit form some of the most popular algorithms. We will discuss batch as 
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well as Online algorithms. This chapter can also be considered as a complement to Chapter 8, where 
some aspects of convex optimization were introduced; a number of algorithms discussed there are also 
appropriate for tasks involving sparsity-related constraints/regularization. 

Besides describing various algorithmic families, some variants of the basic sparsity promoting t\ 
and Iq norms are discussed. Furthermore, some typical examples are considered and a case study 
concerning time-frequency analysis is presented. Finally, a discussion concernig the issue “synthesis 
versus analysis’’ models is provided. 


10.2 SPARSITY PROMOTING ALGORITHMS 

In the previous chapter, our emphasis was on highlighting some of the most important aspects un- 
derlying the theory of sparse signal/parameter vector recovery from an underdetermined set of linear 
equations. We now turn our attention to the algorithmic aspects of the problem (e.g., [52,54]). The 
issue now becomes that of discussing efficient algorithmic schemes, which can achieve the recovery of 
the unknown set of parameters. In Sections 9.3 and 9.5, we saw that the constrained i\ norm minimiza- 
tion (basis pursuit) can be solved via linear programming techniques and the LASSO task via convex 
optimization schemes. However, such general purpose techniques tend to be inefficient, because they 
often require many iterations to converge, and the respective computational resources can be excessive 
for practical applications, especially in high-dimensional spaces B/. As a consequence, a huge research 
effort has been invested for the goal of developing efficient algorithms that are tailored to these specific 
tasks. Our aim here is to provide the reader with some general trends and philosophies that characterize 
the related activity. We will focus on the most commonly used and cited algorithms, which at the same 
time are structurally simple, so the reader can follow them, without deeper knowledge of optimiza¬ 
tion. Moreover, these algorithms involve, in one way or another, arguments that are directly related to 
notions we have already used while presenting the theory; thus, they can also be exploited from a peda- 
gogical point of view in order to strengthen the reader’s understanding of the topic. We start our review 
with the class of batch algorithms, where ali data are assumed to be available prior to the application of 
the algorithm, and then we will move on to online/time-adaptive schemes. Furthermore, our emphasis 
is on algorithms that are appropriate for any sensing matrix. This is stated in order to point out that 
in the literature, efficient algorithms have also been developed for specific forms of highly structured 
sensing matrices, and exploiting their particular structure can lead to reduced computational demands 
[61,93]. 

There are three rough types of families along which this algorithmic activity has grown; (a) greedy 
algorithms, (b) iterative shrinkage schemes, and (c) convex optimization techniques. We have used the 
word rough because in some cases, it may be difficult to assign an algorithm to a specific family. 

10.2.1 GREEDY ALGORITHMS 

Greedy algorithms have a long history; see, for example, [1 14] for a comprehensive list of references. 
In the context of dictionary learning, a greedy algorithm known as matching pursuit was introduced 
in [88]. A greedy algorithm is built upon a series of locally optimal single-term updates. In our con¬ 
text, the goals are (a) to unveil the “active” columns of the sensing matrix X, that is, those columns 
that correspond to the nonzero locations of the unknown parameters, and (b) to estimate the respective 
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sparse parameter vector. The set of indices that correspond to the nonzero vector components is also 
known as the support. To this end, the set of active columns of X (and the support) is increased by 
one at each iteration step. In the sequel, an updated estimate of the unknown sparse vector is obtained. 
Let us assume that at the (i — l)th iteration step, the algorithm has selected the columns denoted 
as , x C j ... ,x C j. [ , with 71 , 72 ,..., 7 ;-i e {1, 2, ..., These indices are the elements of the cur- 
rently available support, Let ’ be the N x (i — 1) matrix, having x c ^ , Xj 2 , ..., Xj. i as its 

columns. Let also the current estimate of the solution be 0"~ l 1 , which is a (i — l)-sparse vector, with 
zeros at ali locations with index outside the support. The orthogonal matching pursuit (OMP) scheme 
given in Algorithm 10.1 builds up recursively a sparse solution. 

Algorithm 10.1 (The OMP algorithm). 

The algorithm is initialized with 0 (O> := 0, e (0) := y. and .S ,(0) = 0. At iteration step i, the following 
computational steps are performed: 

1. Select the column x c - of X, which is maximally correlated to (forms the smallest angle with) the 

respective error vector, e ( ‘~^ := y — , that is, 

|x c7 V' -1) | 

x< ji ■ ji ■= argmax /=1 2 / J c ,, -. 

F 7 H 2 

2. Update the support and the corresponding set of active columns, S (l> = U { 7 ;} and X (l> — 

[ Z (i-D , x <:] 

Ji 

3. Update the estimate of the parameter vector: Solve the least-squares (LS) problem that minimizes 
the norm of the error, using the active columns of X only, that is, 

0 : = argmin ze]K i || y - X (l) z || 2 . 

Obtain 0 U) by inserting the elements of 6 in the respective locations ( 71 , 72 ,..., ji), which comprise 
the support (the rest of the elements of B 1 '’ retain their zero values). 

4. Update the error vector 

e (i) :=y-X0 (i \ 

The algorithm terminates if the norm of the error becomes less than a preselected user-defined 
constant, eo- The following observations are in order. 

Remarks 10.1. 

• Because 0 U \ in Step 3, is the resuit of an LS task, we know from Chapter 6 that the error vector is 
orthogonal to the subspace spanned by the active columns involved, that is, 

e (,) _l_span {jcf ,..., x c j.}. 

This guarantees that in the next step, taking the correlation of the columns of X with e (l> , none of 
the previously selected columns will be reselected; they resuit to zero correlation, being orthogonal 
to (see Fig. 10.1). 
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FIGURE 10.1 

The error vector at the th iteration is orthogonal to the subspace spanned by the currently available set of active 
columns. Here is an illustration for the case of the three-dimensional Euclidean space R 3 , and for i = 2. 


• The column which has maximal correlation (maximum absolute value of the inner product) with the 
currently available error vector is the one that maximally reduces (compared to any other column) 
the €2 norm of the error, when y is approximated by linearly combining the currently available 
active columns. This is the point where the heart of the greedy strategy beats. This minimization is 
with respect to a single term , keeping the rest fixed, as they have been obtained from the previous 
iteration steps (Problem 10.1). 

• Starting with all the components being zero, if the algorithm stops after ko iteration steps, the resuit 
will be a ko-sparse solution. 

• Note that there is no optimality in this searching strategy. The only guarantee is that the I 2 norm 
of the error vector is decreased at every iteration step. In general, there is no guarantee that the 
algorithm can obtain a solution close to the true one (see, for example, [38]). However, under certain 
constraints on the structure of X, performance bounds can be obtained (see, for example, [37,115, 
123]). 

• The complexity of the algorithm amounts to O (IcqIN) operations, which are contributed by the com- 
putations of the correlations, plus the demands raised by the solution of the LS task in step 3, whose 
complexity depends on the specific algorithm used. The ko is the sparsity level of the delivered 
solution and, hence, the total number of iteration steps that are performed. 

Another more qualitative argument that justifies the selection of the columns based on their corre¬ 
lation with the error vector is the following. Assume that the matrix X is orthonormal. Let y — X0. 
Then y lies in the subspace spanned by the active columns of X, that is, those that correspond to the 
nonzero components of 9. Hence, the rest of the columns are orthogonal to y, because X is assumed to 
be orthonormal. Taking the correlation of y, at the first iteration step, with all the columns, it is certain 
that one among the active columns will be chosen. The inactive columns resuit in zero correlation. 
A similar argument holds true for all subsequent steps, because all activity takes place in a subspace 
that is orthogonal to all the inactive columns of X. In the more general case, where X is not orthonor¬ 
mal, we can stili use the correlation as a measure that quantifies geometric similarity. The smaller the 
correlation/magnitude of the inner product is, the more orthogonal the two vectors are. This brings 
us back to the notion of mutual coherence, which is a measure of the maximum correlation (smallest 
angle) among the columns of X. 
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OMP Can Recover Optimal Sparse Solutions: Sufficiency Condition 

We have already stated that, in general, there are no guarantees that OMP will recover optimal Solu¬ 
tions. However, when the unknown vector is sufficiently sparse, with respect to the structure of the 
sensing matrix X, OMP can exactly solve the Iq minimization task in (9.20) and recover the solution 
in k{) steps, where ko is the sparsest solution that satisfies the associated linear set of equations. 

Theorem 10.1. Let the mutual coherence (Section 9.6.1) of the sensing matrix X be /i (X ). Assume, 
also, that the linear system y — X0 accepts a solution such as 

||0||o < - ( 1 H--—). (10.1) 

110 2 \ n(X)J 

Then OMP guarantees recovery ofthe sparsest solution in ko = \\6 1| 0 steps. 

We know from Section 9.6.1 that under the previous condition, any other solution will be necessarily 
less sparse. Hence, there is a unique way to represent y in terms of ko columns of X. Without harming 
generality, let us assume that the true support corresponds to the hrst ko columns of X, that is, 


ko 

y = H 0 i x P V./e{l,...,^ 0 }. 

l=i 

The theorem is a direct consequence of the following proposition. 

Proposition 10.1. Iftlie condition (10.1) holds true, then the OMP algorithm will never select a column 
with index outside the true support (see, for example, [115] and Problem 10.2). In a more formal way, 
this is expressed as 

\x c - T e^'-^ | 

ji = argmax /= t 2,...,/ ^ c ,, -e { 1 ,..., k 0 }. 

Fi II2 

A geometric interpretation of this proposition is the following. If the angles formed between all the 
possible pairs among the columns of X are close to 90° (columns almost orthogonal) in the B/ space, 
which guarantees that /i(X) is small enough, then y will lean more (form a smaller angle) toward any 
one of the active columns that contribute to its formation, compared to the rest that are inactive and 
do not participate in the linear combination that generates y. Fig. 10.2 illustrates the geometry, for 
the extreme case of mutually orthogonal vectors (Fig. 10. 2A), and for the more general case where 
the vectors are not orthogonal, yet the angle between any pair of columns is close enough to 90° 
(Fig. 10.2B). 

In a nutshell, the previous proposition guarantees that during the hrst iteration, a column corre- 
sponding to the true support will be selected. In a similar way, this is also true for all subsequent 
iterations. In the second step, another column, different from the previously selected one (as has al¬ 
ready been stated) will be chosen. At step ko, the last remaining active column corresponding to the 
true support is selected, and this necessarily results in zero error. To this end, it suffices to set eo equal 
to zero. 
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FIGURE 10.2 

(A) In the case of an orthogonal matrix, the observations vector y will be orthogonal to any inactive column, here, 
Xj. (B) In the more general case, it is expected to “lean” closer (form smaller angles) to the active than to the 
inactive columns. 


The LARS Algorithm 

The least-angle regression (LARS) algorithm [48] shares the first two steps with OMR It selects jj 
to be an index outside the currently available active set in order to maximize the correlation with the 
residual vector. However, instead of performing an LS fit to compute the nonzero components of 6 {l \ 
these are computed so that the residual will be equicorrelated with ali the columns in the active set, that 

is, 

^ (y - xe (i> ) | = constant, V; e S (0 , 

where we have assumed that the columns of X are normalized, as is common in practice (recall, also, 
Remarks 9.4). In other words, in contrast to the OMR where the error vector is forced to be orthogonal 
to the active columns, LARS demands this error to form equal angles with each one of them. Like 
OMP, it can be shown that, provided the target vector is sufficiently sparse and under incoherence of 
the columns of X, LARS can exactly recover the sparsest solution [116]. 

A further small modification leads to the LARS-LASSO algorithm. According to this version, a pre- 
viously selected index in the active set can be removed at a later stage. This gives the algorithm the 
potential to “recover” from a previously bad decision. Hence, this modification departs from the striet 
rationale that defines the greedy algorithms. It turns out that this version solves the LASSO optimiza- 
tion task. This algorithm is the same as the one suggested in [99] and it is known as a homotopy 
algorithm. Homotopy methods are based on a continuous transformation from one optimization task 
to another. The Solutions to this sequence of tasks lie along a continuous parameterized path. The idea 
is that while the optimization tasks may be difficult to solve by themselves, one can trace this path of 
Solutions by slowly varying the parameters. For the LASSO task, it is the k parameter that is varying 
(see, for example, [4,86,104]). Take as an example the LASSO task in its regularized version in (9.6). 
For k — 0, the task minimizes the €2 error norm and for k —> 00 the task minimizes the parameter 
vector’s t\ norm, and for this case the solution tends to zero. It turns out that the solution path, as k 
changes from large to small values, is polygonal. Vertices on this solution path correspond to vectors 
having nonzero elements only on a subset of entries. This subset remains unchanged until k reaches 
the next critical value, which corresponds to a new vertex of the polygonal path and to a new subset of 
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potential nonzero values. Thus, the solution is obtained via this sequence of steps along this polygonal 
path. 

Compressed Sensing Matching Pursuit (CSMP) Algorithms 

Strictly speaking, the algorithms to be discussed here are not greedy, yet as stated in [93], they are at 
heart greedy algorithms. Instead of performing a single term optimization per iteration step, in order to 
increase the support by one, as is the case with OMP, these algorithms attempt to obtain first an estimate 
of the support and then use this information to compute an LS estimate of the target vector, constrained 
on the respective active columns. The quintessence of the method lies in the near-orthogonal nature of 
the sensing matrix, assuming that this obeys the RIP condition. 

Assume that X obeys the RIP for some small enough value 8 k and sparsity level k of the unknown 
vector. Let, also, the measurements be exact, that is, y = XO. Then X T y — X 1 XO 6 , due to the near- 
orthogonal nature of X. Therefore, intuition indicates that it is not unreasonable to select, in the first 
iteration step, the t (a user-defined parameter) largest in magnitude components of X 7 y as indicative 
of the nonzero positions of the sparse target vector. This reasoning carries on for all subsequent steps, 
where, at the /th iteration, the place of y is taken by the residual e (l ~ 11 := y — X0 { '~ 1 \ where 
indicates the estimate of the target vector at the (i — l)th iteration. Basically, this could be consid- 
ered as a generalization of the OMP. However, as we will soon see, the difference between the two 
mechanisms is more substantial. 


Algorithm 10.2 (The CSMP scheme). 

1. Select the value of t. 

2. Initialize the algorithm: 0 lly> = 0, e {ty> = y. 

3. For i = 1,2,..., execute the following; 

(a) Obtain the current support: 


indices of the t largest in magnitude I 
components of X T e ( - l ~^ j 

(b) Select the active columns: Construet X (l> to comprise the active columns of X in accordance 
to S-'K Obviously, X u> is an /V x r matrix, where r denotes the cardinality of the support set 
S&. 

(c) Update the estimate of the parameter vector: solve the LS task 


S (i> := supp (0 ( '' _1) ) U 


6 argmin 


zeR r 


y - X (i) z 


Obtain 0* * e R 1 having the r elements of 0 in the respective locations, as indicated by the 
support, and the rest of the elements being zero. 

(d) 0 (l) := H k (h' The mapping H k denotes the hard thresholding function; that is, it returns a 
vector with the k largest in magnitude components of the argument, and the rest are forced to 
zero. 

(e) Update the error vector: e (l> = y — X0 {1] . 
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The algorithm requires as input the sparsity level k. Iterations carry on until a halting criterion is 
met. The value of t, which determines the largest in magnitude values in steps 1 and 3a, depends on 
the specific algorithm. In the compressive sampling matching pursuit (CoSaMP) [93], t = 2k (Prob- 
lem 10.3), and in the subspace pursuit (SP) [33], t — k. 

Having stated the general scheme, a major difference with OMP becomes readily apparent. In 
OMP, only one column is selected per iteration step. Moreover, this remains in the active set for ali 
subsequent steps. If, for some reason, this was not a good choice, the scheme cannot recover from 
such a bad decision. In contrast, the support and, hence, the active columns of X are continuously 
updated in CSMP, and the algorithm has the ability to correct a previously bad decision, as more 
information is accumulated and iterations progress. In [33], it is shown that if the measurements are 
exact ( y = XO), then SP can recover the k-sparse true vector in a finite number of iteration steps, 
provided that X satisfies the RIP with < 0.205. If the measurements are noisy, performance bounds 
have been derived, which hold true for <$ 3 * < 0.083. For the CoSaMP, performance bounds have been 
derived for < 0 . 1 . 

10.2.2 ITERATIVE SHRINKAGE/THRESHOLDING (IST) ALG0RITHMS 

This family of algorithms also have a long history (see, for example, [44,69,70,73]). However, in the 
“early” days, most of the developed algorithms had some sense of heuristic flavor, without establishing 
a ciear bridge with optimizing a cost function. Later attempts were substantiated by sound theoretical 
arguments concerning issues such as convergence and convergence rate [31,34,50,56]. 

The general form of this algorithmic family has a striking resemblance to the classical linear al- 
gebra iterative schemes for approximating the solution of large linear systems of equations, known as 
stationciry iterative or iterative relaxation methods. The classical Gauss-Seidel and Jacobi algorithms 
(e.g., [65]), in numerical analysis can be considered members of this family. Given a linear system of 
l equations with / unknowns, z — Ax, the basic iteration at step i has the following form: 

x® = (I - QA)x {i ~ 1} + Qz 

= x«-» + Qe<i- 1 \ e (i ~ l) :=z-Ax (i - l \ 

which does not come as a surprise. It is of the same form as most of the iterative schemes for numerical 
Solutions! The matrix Q is chosen in order to guarantee convergence, and different choices lead to 
different algorithms with their pros and cons. It turns out that this algorithmic form can also be applied 
to underdetermined systems of equations, y = XO, with a "minor” modiiication, which is imposed 
by the sparsity constraint of the target vector. This leads to the following general form of iterative 
computation: 

0(0 = e ('-D = y- xe (i ~ l \ 

starting from an initial guess of 0 ltr> (usually 0 <iy> — 0, e >( ^ = y ). In certain cases, Q can be made to 
be iteration dependent. The function 7] is a nonlinear thresholding that is applied entry-wise , that is, 
component-wise. Depending on the specific scheme, this can be either the hard thresholding function, 
denoted as 74, or the soft thresholding function, denoted as S a . Hard thresholding, as we already know, 
keeps the k largest components of a vector unaltered and sets the rest equal to zero. Soft thresholding 
was introduced in Section 9.3. Ali components with magnitude less than a threshold value a are forced 
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to zero and the rest are reduced in magnitude by a; that is, the /th component of a vector 0 , after soft 
thresholding, becomes 


(Sa(9))j =sgn(0;)(|0j !-«) + • 

Depending on (a) the choice of 7',, (b) the specific value of the parameter k or a, and (c) the matrix Q, 


different instances occur. The most common choice for Q is jiX 1 , and the generic form of the main 
iteration becomes 



( 10 . 2 ) 


where /r is a relaxation (user-defined) parameter, which can also be left to vary with each iteration step. 
The choice of X T is intuitively justified, once more, by the near-orthogonal nature of X. For the hrst 
iteration step and for a linear system of the form y = X0, starting from a zero initial guess, we have 
X 1 y = X 1 XO ~ 0 and we are close to the solution. 

Although intuition is most important in scientific research, it is not enough, by itself, to justify 
decisions and actions. The generic scheme in (10.2) has been reached from different paths, following 
different perspectives that lead to different choices of the involved parameters. Let us spend some more 
time on that, with the aim of making the reader more familiar with techniques that address optimization 
tasks of nondifferentiable loss functions. The term in parentheses in (10.2) coincides with the gradient 
descent iteration step if the cost function is the unregularized sum of squared errors cost (LS), that is, 



2 

2 ' 


In this case, the gradient descent rationale leads to 


-1 = flh-i) _ tx X T (X0 {i -' ) - y) 

= 0 (i ~ l) + lu,X T e (i ~ l) . 



The gradient descent can alternatively be viewed as the resuit of minimizing a regularized version of 
the linearized cost function (verify it). 



(10.3) 


One can adopt this view of the gradient descent philosophy as a kick-off point to minimize iteratively 
the following LASSO task: 
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The difference now is that the loss function comprises two terms: one that is smooth (differentiable) 
and a nonsmooth one. Let the current estimate be 0 ( '~ l) . The updated estimate is obtained by 


*“> =argmin» (I , (■/(*«-■>) + (9 - ««-■))'' 

which, after ignoring constants, is equivalently written as 


0 <'■> = 


argmin #eR / 


\\0 -0\l + XpL\\O\\ x 
2 z 


(10.4) 


where 


0 := 0 {i ~ l) 


M d0 


(10.5) 


Following exactly the same steps as those that led to the derivation of (9.13) from (9.6) (after replacing 
0LS with 0 ). we obtain 


6 {,) = S i4 ,m = S Xll (0 {i ~ X) - —~) (10.6) 

= M0 (/ “ 1) +F* 7 V , '- U )- (10.7) 

This is very interesting and practically useful. The only effect of the presence of the nonsmooth £\ 
norm in the loss function is an extra simple thresholding operation, which as we know is an operation 
performed individually on each component. It can be shown (e.g., [ 1 1,95]) that this algorithm converges 
to a minimizer 0* of the LASSO (9.6), provided that /i e (0, l/L max (X 7 X)), where A lnax (-) denotes 
the maximum eigenvalue of X 1 X. The convergence rate is dictated by the rule 

L(0 o) ,L)-L(0*,A.)^ 0(l/i), 

which is known as sublinear global rate of convergence. Moreover, it can be shown that 

c||0 (O) -0J:5 

L(0 {l) ,X)-L(0*,X)< -J--- 

2 1 

The latter resuit indicates that if one wants to achieve an accuracy of e, then this can be obtained by at 
i CII 6> (0) —6»* II ^ , 

most [— 1 —— J iterations, where [•] denotes the floor function. 

In [34], (10.2) was obtained from a nearby corner, building upon arguments from the classical 
proximal point methods in optimization theory (e.g., [105]). The original LASSO regularized cost 
function is modified to the surrogate objective , 
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where 


d(0,0) :=c\\0 -0\\l - \\X0-X0 


If c is appropriately chosen (larger than the largest eigenvalue of X T X), the surrogate objective is guar- 
anteed to be strictly convex. Then it can be shown (Problem 10.4) that the minimizer of the surrogate 
objective is given by 


0 = S X / c (O + - c X T (y-X~O)). 


( 10 . 8 ) 


In the iterative formulation, 0 is selected to be the previously obtained estimate; in this way, one 
tries to keep the new estimate close to the previous one. The procedure readily results in our generic 
scheme in (10.2), using soft thresholding with parameter k/c. It can be shown that such a strategy 
converges to a minimizer of the original LASSO problem. The same algorithm was reached in [56], 
using majorization-minimization techniques from optimization theory. So, from this perspective, the 
IST family has strong ties with algorithms that belong to the convex optimization category. 

In [118], the sparse reconstruction by sepa rabie approximation (SpaRSA) algorithm is proposed, 
which is a modihcation of the Standard IST scheme. The starting point is (10.3); however, the multiply- 
ing factor, instead of being constant, is now allowed to change from iteration to iteration according 
to a rule. This results in a speedup in the convergence of the algorithm. Moreover, inspired by the ho- 
motopy family of algorithms, where k is allowed to vary, SpaRSA can be extended to solve a sequence 
of problems that are associated with a corresponding sequence of values of k. Once a solution has been 
obtained for a particular value of k , it can be used as a “warm start” for a nearby value. Solutions can 
therefore be computed for a range of values, at a small extra computational cost, compared to solving 
for a single value from a “cold start.” This technique abides by the continuatiori strategy , which has 
been used in the context of other algorithms as well (e.g., [66]). Continuation has been shown to be a 
very successful tool to increase the speed of convergence. 

An interesting variation of the basic IST scheme has been proposed in [11], which improves the 
convergence rate to 0(1/i 2 ), by only a simple modihcation with almost no extra computational burden. 
The scheme is known as fast iterative shrinkage-thresholding algorithm (FISTA). This scheme is an 
evolution of [96], which introduced the basic idea for the case of differentiable costs, and consists of 
the following steps: 


e (f) = S^{z <f) +liX r (y-Xz {i) )), 


Z ( f+1 >;= 0(‘'> + (i-D), 


ti +1 


where 



with initial points t\ = 1 and z (l> — 0 {()) . In words, in the thresholding operation, 0 (!-1) is replaced 
by z il> , which is a specihc linear combination of two successive updates of 0. Hence, at a marginal 
increase of the computational cost, a substantial increase in convergence speed is achieved. 
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In [17] the hard thresholding version has been used, with /x = 1, and the thresholding function A4 
uses the sparsity level k of the target solution, which is assumed to be known. In a later version, [19], 
the relaxation parameter is left to change so that, at each iteration step, the error is maximally reduced. 
It has been shown that the algorithm converges to a local minimum of the cost function |_y — X0\\ 2 , 
under the constraint that 6 is a /c-sparse vector. Moreover, the latter version is a stable one and it results 
in a near-optimal solution if a form of RIP is fulfilled. 

A modified version of the generic scheme given in (10.2), which evolves along the lines of [84], ob- 
tains the updates component-wise, one vector component at a time. Thus, a “full” iteration consists of / 
steps. The algorithm is known as coordinate descent and its basic iteration has the form (Problem 10.5) 



7 = 1 , 2 ,...,/. 


(10.9) 


This algorithm replaces the constant c in the previously reported soft thresholding algorithm with the 
norm of the respective column of X, if the columns of X are not normalized to unit norm. It has been 
shown that the parallel coordinate descent algorithm also converges to a LASSO minimizer of (9.6) 
[50]. Improvements of the algorithm, using line search techniques to determine the steepest descent 
direction for each iteration. have also been proposed (see [124]). 

The main contribution to the complexity for the iterative shrinkage algorithmic family comes from 
the two matrix-vector products, which amounts to O(Nl), unless X has a special structure (e.g., DFT) 
that can be exploited to reduce the load. 

In [85], the two-stage thresholding (TST) scheme is presented, which brings together arguments 
from the iterative shrinkage family and the OMP. This algorithmic scheme involves two stages of 
thresholding. The first step is exactly the same as in (10.2). However, this is now used only for determin- 
ing “significant” nonzero locations, just as in CSMP algorithms, presented in the previous subsection. 
Then, an LS problem is solved to provide the updated estimate, under the constraint of the available 
support. This is followed by a second step of thresholding. The thresholding operations in the two 
stages can be different. In hard thresholding, /7/.. is used in both steps; this results in the algorithm pro¬ 
posed in [58]. For this latter scheme, convergence and performance bounds are derived if the RIP holds 
for Sm < 0.58. In other words, the basic difference between the TST and CSMP approaches is that, 
in the latter case, the most significant nonzero coefficients are obtained by looking at the correlation 
term X 1 and in the TST family by looking at + /iX T e (l ~ l] . The differences between dif¬ 

ferent approaches can be minor and the Crossing lines between the different algorithmic categories are 
not necessarily crispy and ciear. However, from a practical point of view, sometimes small differences 
may lead to substantially improved performance. 

In [41], the IST algorithmic framework was treated as a message passing algorithm in the context 
of graphical models (Chapter 15), and the following modified recursion was obtained: 

0 (i) = 7)(6> (i '“ 1) + X 7 V !-1) ), (10.10) 

z (i ~ !) =y- + -z {i ~ 2) T'(0 (i - 2) + X T z (i ~ 2) ), (10.11) 

a 7 

where a = y, the overbar denotes the average over all the components of the corresponding vector, 
and T! denotes the respective derivative of the component-wise thresholding rule. The extra term on 
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the right-hand side in (10.11) which now appears turns out to provide a performance improvement 
of the algorithm, compared to the IST family, with respect to the undersampling-sparsity tradeoff 
(Section 10.2.3). Note that 7) is iteration dependent and it is controlled via the definition of certain pa- 
rameters. A parameterless version of it has been proposed in [91]. A detailed treatment on the message 
passing algorithms can be found in [2]. 

Remarks 10.2. 

• The iteration in (10.6) bridges the IST algorithmic family with another powerful tool in convex 
optimization, which builds upon the notion of proximat mapping or Moreau envelopes (see Chap- 
ter 8 and, e.g., [32,105]). Given a convex function h : IfP —> R and a /x > 0, the proximal mapping, 
Prox^/, : K / 1 —> M !, with respect to h and of index /x, is defined as the (unique) minimizer 

Prox^C*:) := argmin„ eR ( \h(v) + H* - u||l}, Vrel'. (10.12) 

Let us now assume that we want to minimize a convex function, which is given as the sum 

f(0) = J(0)+h(0), 

where J is convex and differentiable, and h is also convex, but not necessarily smooth. Then it can 
be shown (Section 8.14) that the following iterations converge to a minimizer of /, 

e (l) = Prox^A (V'- 1 >-/x 9/ ^ > ' ) ), (10.13) 

where /x > 0, and it can also be made iteration dependent, that is, /x,- > 0. If we now use this scheme 
to minimize our familiar cost, 

J ( 0 ) + X \\0 1| i, 

we obtain (10.6); this is so because the proximal operator of h(0) := X ||0|| [ is shown (see [31,32] 
and Section 8.13) to be identical to the soft thresholding operator, that is, 

Prox/, (0) = S)_(0). 

In order to feel more comfortable with this operator, note that if h(x) = 0, its proximal operator is 
equal to x, and in this case (10.13) becomes our familiar gradient descent algorithm. 

• All the nongreedy algorithms that have been discussed so far have been developed to solve the task 
defined in the formulation (9.6). This is mainly because this is an easier task to solve; once X has 
been fixed, it is an unconstrained optimization task. However, there are algorithms that have been 
developed to solve the alternative formulations. 

The NESTA algorithm has been proposed in [12] and solves the task in its (9.8) formulation. Adopt- 
ing this path can have an advantage because e may be given as an estimate of the uncertainty 
associated with the noise, which can readily be obtained in a number of practical applications. In 
contrast, selecting a priori the value for X is more intricate. In [28], the value X = a f) In /, where 
er, ; is the noise Standard deviation, is argued to have certain optimality properties; however, this 
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argument hinges on the assumption of the orthogonality of X. NESTA relies heavily on Nesterov’s 
generic scheme [96], hence its name. The original Nesterov algorithm performs a constrained min- 
imization of a smooth convex function f(0), i.e., 


min f ifi), 

OeQ 

where Q is a convex set, and in our case this is associated with the quadratic constraint in (9.8). The 
algorithm consists of three basic steps. The first one involves an auxiliary variable, and is similar to 
the step in (10.3), i.e., 

w U) = argrmD eeQ \{0-0 li - l) ) T df ^ 0 ~ + |j|0 ~ (10.14) 

where L is an upper bound on the Lipschitz coefficient, which the gradient of / has to satisfy. The 
difference with (10.3) is that the minimization is now a constrained one. However, Nesterov has 
also added a second step, involving another auxiliary variable, z (l \ which is computed in a similar 
way as w (l \ but the linearized term is now replaced by a weighted cumulative gradient. 


X>(0-0 w ) r 


k=0 


9 /( 0 ®) 

d0 


The effect of this term is to smooth out the “zig-zagging” of the path toward the solution, whose 
effect is to increase the convergence speed significantly. The final step of the scheme involves an 
averaging of the previously obtained variables, 

0 U) =tiz {i) + (l -ti)w {i \ 


The values of the parameters a^, k = 0,..., i — 1, and f, resuit from the theory so that convergence 
is guaranteed. As was the case with its close relative FISTA, the algorithm enjoys an 0(1/ i 2 ) 
convergence rate. In our case, where the function to be minimized, ||0||j, is not smooth, NESTA 
uses a smoothed prox-function of it. Moreover, it turns out that closed form updates are obtained for 
z (,) and w (l} . If X is chosen in order to have orthonormal rows, the complexity per iteration is 0(1) 
plus the computations needed for performing the product X T X, which is the most computationally 
thirsty part. However, this complexity can substantially be reduced if the sensing matrix is chosen to 
be a submatrix of a unitary transform which admits fast matrix-vector product computations (e.g., 
a subsampled DFT matrix). For example, for the case of a subsampled DFT matrix, the complexity 
amounts to 0(1) plus the load to perform the two fast Fourier transforms (FFTs). Moreover, the 
continuation strategy can also be employed to accelerate convergence. In [12], it is demonstrated 
that NESTA exhibits good accuracy results, while retaining a complexity that is competitive with 
algorithms developed around the (9.6) formulation and scales in an affordable way for large-size 
problems. Furthermore, NESTA and, in general, Nesterov’s scheme enjoy a generality that allows 
their use for other optimization tasks as well. 

• The task in (9.7) has been considered in [14] and [99]. In the former, the algorithm comprises a 
projection on the i\ ball \\0 1| i < p (see also Section 10.4.4) per iteration step. The most compu¬ 
tationally dominant part of the algorithm consists of matrix-vector products. In [99], a homotopy 
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algorithm is derived for the same task, where now the bound p becomes the homotopy parameter 

that is left to vary. This algorithm is also referred to as the LARS-LASSO, as has already been 

reported. 

10.2.3 WHICH ALGORITHM? SOME PRACTICAL HINTS 

We have already discussed a number of algorithmic alternatives to obtain Solutions to the £q or £\ norm 
minimization tasks. Our focus was on schemes whose computational demands are rather low and that 
scale well to very large problem sizes. We have not touched more expensive methods such as interior 
point methods for solving the i\ convex optimization task. A review of such methods is provided in 
[72]. Interior point methods evolve along the Newton-type recursion and their complexity per iteration 
step is at least of the order O (/ 3 ). As is most often the case, there is a tradeoff. Schemes of higher 
complexity tend to resuit in enhanced performance. However, such schemes become impractical in 
problems of large size. Some examples of other algorithms that were not discussed can be found in 
[14,35,118,121]. Talking about complexity, it has to be pointed out that what really matters at the 
end is not so much the complexity per iteration step, but the overall required resources in computer 
time/memory for the algorithm to converge to a solution within a specified accuracy. For example, 
an algorithm may be of low complexity per iteration step, but it may need an excessive number of 
iterations to converge. 

Computational load is only one among a number of indices that characterize the performance of an 
algorithm. Throughout the book so far, we have considered a number of other performance measures, 
such as convergence rate, tracking speed (for the adaptive algorithms), and stability with respect to 
the presence of noise and/or finite word length computations. No doubt, ali these performance mea¬ 
sures are of interest here as well. However, there is an additional aspect that is of particular importance 
when quantifying performance of sparsity promoting algorithms. This is related to the undersampling— 
sparsity tradeoff or the phase transition curve. 

One of the major issues on which we focused in Chapter 9 was to derive and present the conditions 
that guarantee uniqueness of the £q minimization and its equivalence to the l\ minimization task, under 
an underdetermined set of observations, y = XO, for the recovery of sparse enough signals/vectors. 
While discussing the various algorithms in this section, we reported a number of different RIP-related 
conditions that some of the algorithms have to satisfy in order to recover the target sparse vector. 
As a matter of fact, it has to be admitted that this was quite confusing, because each algorithm had to 
satisfy its own conditions. In addition, in practice, these conditions are not easy to be verified. Although 
such results are undoubtedly important to establish convergence, make us more confident, and help us 
better understand why and how an algorithm works, one needs further experimental evidence in order 
to establish good performance bounds for an algorithm. Moreover, all the conditions we have dealt 
with, including coherence and RIP, are sufficient conditions. In practice, it turns out that sparse signal 
recovery is possible with sparsity levels much higher than those predicted by the theory, for given N 
and I. Hence, proposing a new algorithm or selecting an algorithm from an available palette, one has 
to demonstrate experimentally the range of sparsity levels that can be recovered by the algorithm, as a 
percentage of the number of observations and the dimensionality. Thus, in order to select an algorithm, 
one should cast her/his vote for the algorithm that, for given / and N, has the potential to recover 
/:-sparse vectors with k being as high as possible for most of the cases, that is, with high probability. 

Fig. 10.3 illustrates the type of curve that is expected to resuit in practice. The vertical axis is the 
probability of exact recovery of a target /.-sparse vector and the horizontal axis shows the ratio k/N, 
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for a given number of observations, N, and the dimensionality of the ambient space, /. Three curves are 
shown. The red ones correspond to the same algorithm, for two different values of the dimensionality /, 
and the gray one corresponds to another algorithm. Curves of this shape are expected to resuit from 
experiments of the following setup. Assume that we are given a sparse vector 0 o , with k nonzero 
components in the /-dimensional space. Using a sensing matrix X, we generate N samples/observations 
y = X0 o . The experiment is repeated a number of M times, each time using a different realization of 
the sensing matrix and a different k -sparse vector. For each instance, the algorithm is run to recover the 
target sparse vector. This is not always possible. We count the number, m, of successful recoveries, and 
compute the corresponding percentage of successful recovery (probability), m/M, which is plotted on 
the vertical axis of Fig. 10.3. The procedure is repeated for a different value of k, 1 < k < N. A number 
of issues now jump onto the stage: (a) how does one select the ensemble of sensing matrices, and (b) 
how does one select the ensemble of sparse vectors? There are different scenarios, and some typical 
examples are described next. 

1. The N x l sensing matrices X are formed by: 

(a) different i.i.d. realizations with elements drawn from a Gaussian A r (0. 1/A), 

(b) different i.i.d. realizations from the uniform distribution on the unit sphere in R iV , which is 
also known as the uniform spherical ensemble, 

(0 different i.i.d. realizations with elements drawn from Bernoulli-type distributions, 

(d) different i.i.d. realizations of partial Fourier matrices, each time using a different set of N rows. 

2. The A:-sparse target vector 0 O is formed by selecting the locations of (at most) k nonzero elements 
randomly, by “tossing a coin” with probability p — k/ Z, and filling the values of the nonzero ele¬ 
ments according to a statistical distribution (e.g., Gaussian, uniform, double exponential, Cauchy). 

Other scenarios are also possible. Some authors set all nonzero values to one [16], or to ±1, with 
the randomness imposed on the choice of the sign. It must be stressed that the performance of an 
algorithm may vary significantly under different experimental scenarios, and this may be indicative of 



FIGURE 10.3 

For any algorithm, the transition between the regions of 100% success and complete failure is very sharp. For the 
algorithm corresponding to the red curve, this transition occurs at higher sparsity values and, from this point of 
view, it is a better algorithm than the one associated with the gray curve. Also, given an algorithm, the higher the 
dimensionality, the higher the sparsity level where this transition occurs, as indicated by the two red curves. 
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the stability of an algorithm. In practice, a user may be interested in a specific scenario that is more 
representative of the available data. 

Looking at Fig. 10.3, the following conclusions are in order. In ali curves, there is a sharp transition 
between two levels, from the 100% success to the 0% success. Moreover, the higher the dimensionality, 
the sharper the transition. This has also been shown theoretically in [40]. For the algorithm corre- 
sponding to the red curves, this transition occurs at higher values of k, compared to the algorithm that 
generates the curve drawn in gray. Provided that the computational complexity of the “red” algorithm 
can be accommodated by the resources that are available for a specific application, this seems to be the 
more sensible choice between the two algorithms. However, if the resources are limited, concessions 
are unavoidable. 

Another way to “interrogate” and demonstrate the performance of an algorithm, with respect to its 
robustness to the range of values of sparsity levels that can be successfully recovered, is via the phase 
transition cun>e. To this end, define: 

• a := y, which is a normalized measure of the problem indeterminacy, 

• /1 := jj, which is a normalized measure of sparsity. 

In the sequel, plot a graph having a e [0, 1] in the horizontal axis and fi e [0, 1] in the vertical one. 
For each point (a, fi) in the [0, 1] x [0, 1] region, compute the probability of the algorithm to recover 
a k-sparse target vector. In order to compute the probability, one has to adopt one of the previously 
stated scenarios. In practice, one has to form a grid of points that cover densely enough the region 

[0,1] x [0, 1] in the graph. Use a varying intensity level scale to color the corresponding (a, fi) point. 

Black corresponds to probability one and red to probability zero. Fig. 10.4 illustrates the type of graph 
that is expected to be recovered in practice for large values of /; that is, the transition from the region 
(phase) of “success” (black) to that of “fail” (red) is very sharp. As a matter of fact, there is a curve 
that separates the two regions. The theoretical aspects of this curve have been studied in the context 
of combinatorial geometry in [40] for the asymptotic case, l —> oo, and in [42] for finite values of /. 
Observe that the larger the value of a (larger percentage of observations), the larger the value of fi 
at which the transition occurs. This is in line with what we have said so far in this chapter, and the 
problem gets increasingly difficult as one moves up and to the left in the graph. In practice, for smaller 
values of /, the transition region from red to black is smoother, and it gets narrower as / increases. In 
such cases, one can draw an approximate curve that separates the “success” and “fail” regions using 
regression techniques (see, e.g., [85]). 

The reader may already be aware of the fact that, so far, we have avoided talking about the per¬ 
formance of individual algorithms. We have just discussed some “typical” behavior that algorithms 
tend to exhibit in practice. What the reader might have expected is a discussion of comparative perfor¬ 
mance tests and related conclusions. The reason is that at the time the current edition is being compiled 
there are not definite answers. Most authors compare their newly suggested algorithm with a few other 
algorithms, usually within a certain algorithmic family, and, more importantly, under some specific 
scenarios, where the advantages of the newly suggested algorithm are documented. However, the per¬ 
formance of an algorithm can change significantly by changing the experimental scenario under which 
the tests are carried out. The most comprehensive comparative performance study so far has been car- 
ried out in [85]. However, even in this work, the scenario of exact measurements has been considered 
and there are no experiments concerning the robustness of individual algorithms to the presence of 
noise. It is important to say that this study involved a huge effort of computation. We will comment on 
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FIGURE 10.4 

Typical phase transition behavior of a sparsity promoting algorithm. Black corresponds to 100% success of recov- 
ering the sparsest solution, and red to 0%. For high-dimensional spaces, the transition is very sharp, as is the case 
in the figure. For lower dimensionality values, the transition from black to red is smoother and involves a region of 
varying color intensity. 


some of the findings from this study, which will also reveal to the reader that different experimental 
scenarios can significantly affect the performance of an algorithm. 

Fig. 10. 5A shows the obtained phase transition curves for (a) the iterative hard thresholding (IHT); 

(b) the iterative soft thresholding (IST) scheme of (10.2); (c) the two-stage thresholding (TST) scheme, 
as discussed earlier; (d) the LARS algorithm; and (e) the OMP algorithm, together with the theoret- 
ically obtained one using l \ minimization. All algorithms were tuned with the optimal values, with 
respect to the required user-defined parameters, after extensive experimentation. The results in the fig¬ 
ure correspond to the uniform spherical scenario for the generation of the sensing matrices. Sparse 
vectors were generated according to the ±1 scenario for the nonzero coefficients. The interesting ob- 
servation is that, although the curves deviate from each other as they move to larger values of /1, for 
smaller values, the differences in their performance become less and less. This is also true for com- 
putationally simple schemes such as the IHT one. The performance of LARS is close to the optimal 
one. However, this comes at the cost of computational increase. The required computational time for 
achieving the same accuracy, as reported in [85], favors the TST algorithm. In some cases, LARS re¬ 
quired excessively longer time to reach the same accuracy, in particular when the sensing matrix was 
the partial Fourier one and fast schemes to perform matrix vector products can be exploited. For such 
matrices, the thresholding schemes (IHT, IST, TST) exhibited a performance that scales very well to 
large-size problems. 

Fig. 10. 5B indicates the phase transition curve for one of the algorithms (IST) as we change the 
scenarios for generating the sparse (target) vectors, using different distributions: (a) ±1, with equiprob- 
able selection of signs (constant amplitude random selection [CARS]); (b) double exponential (power); 

(c) Cauchy; and (d) uniform in [— 1, 1]. This is indicative and typical for other algorithms as well, with 
some of them being more sensitive than others. Finally, Fig. 10. 5C shows the transition curves for 
the IST algorithm by changing the sensing matrix generation scenario. Three curves are shown cor- 





10.2 SPARSITY PR0M0TING ALGORITHMS 491 





(C) 


FIGURE 10.5 


(A) The obtained phase transition curves for different algorithms under the same experimental scenario, together 
with the theoretical one. (B) Phase transition curve for the IST algorithm under different experimental scenarios 
for generating the target sparse vector. (C) The phase transition for the IST algorithms under different experimental 
scenarios for generating the sensing matrix X. 


responding to (a) uniform spherical ensemble (USE); (b) random sign ensemble (RSE), where the 
elements are ±1 with signs tmiformly distributed; and (c) the uniform random projection (URP) en¬ 
semble. Once more, one can observe the possible variations that are expected due to the use of different 
matrix ensembles. Moreover, changing ensembles affects each algorithm in a different way. 

Concluding this section, it must be emphasized that algorithmic development is stili an ongoing 
research field, and it is early to come up with definite and concrete comparative performance conclu- 
sions. Moreover, besides the algorithmic front, existing theories often fall short in predicting what is 
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observed in practice, with respect to their phase transition performance. For a related discussion, see, 
for example, [43]. 


10.3 VARIATIONS ON THE SPARSITY-AWARE THEME 

In our tour so far, we have touched a number of aspects of sparsity-aware learning that come from 
mainstream theoretical developments. However, a number of variants have appeared and have been de- 
veloped with the goal of addressing problems of a more special structure and/or proposing alternatives, 
which can be beneficial in boosting the performance in practice by serving the needs of specific appli- 
cations. These variants focus on the regularization term in (9.6), and/or on the misfit measuring term. 
Once more, research activity in this direction has been dense, and our purpose is to simply highlight 
possible alternatives and make the reader aware of the various possibilities that spring forth from the 
basic theory. 

In a number of tasks, it is a priori known that the nonzero parameters in the target signal/vector 
occur in groups and they are not randomly spread in ali possible positions. A typical example is the 
echo path in internet telephony, where the nonzero coefficients of the impulse response tend to cluster 
together (see Fig. 9.5). Other examples of “structured” sparsity can be traced in DNA microarrays, 
MIMO channel equalization, source localization in sensor networks, magnetoencephalography, and 
neuroscience problems (e.g., [1,9,10,60,101]). As is always the case in machine learning, being able to 
incorporate a priori information into the optimization can only be of benefit for improving performance, 
because the estimation task is externally assisted in its effort to search for the target solution. 

The group LASSO [8,59,97,98,117,122] addresses the task where it is a priori known that the 
nonzero components occur in groups. The unknown vector 0 is divided into L groups, that is, 

0 T = [0 T l ,...,0 T L ] T , 

each of them of a predetermined size, , i = 1, 2,..., L, with s; — L The regression model can 
then be written as 

L 

y = X0 + ri = '^2 x i0i + V, 

i=i 

where each X, is a submatrix of X comprising the corresponding s,- columns. The solution of the group 
LASSO is given by the following regularized task: 



L 

2 L 

0 = argmin eeM , ( 

y-J2 x ‘°i 

2 + 1^^ Si 2 )’ 


i= 1 

i =1 


(10.15) 


where || 0, || 2 is the Euclidean norm (not the squared one) of <?;, that is. 


E M- 


N /= 1 
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In other words, the individual components of 0, which contribute to the formation of the t \ norm in 
the Standard LASSO formulation, are now replaced by the square root of the energy of each individual 
block. In this setting, it is not the individual components but blocks of them that are forced to zero, 
when their contribution to the LS misfit measuring term is not significant. Sometimes, this type of 
regularization is coined as the l\/l2 regularization. An example of an i i /A ball for 0 e R 3 can be 
seen in Fig. 10. 6B in comparison with the corresponding l\ ball depicted in Fig. 10. 6A. 



(B) 


(C) 


FIGURE 10.6 

Representation of balls of 8 6 R 3 corresponding to: (A) the l\ norm; (B) an Ix/li with nonoverlapping groups (one 
group comprises {61,62} and the other one {# 3 }); and (C) the l\/li with overlapping groups comprising [61,62,63}, 
{# 1 }, and {0 3 } [6]. 

Beyond the conventional group LASSO, often referred to as block sparsity, research effort has been 
dedicated to the development of learning strategies incorporating more elaborate structured sparse 
models. There are two major reasons for such directions. First, in a number of applications the unknown 
set of parameters, 0, exhibit a structure that cannot be captured by the block sparse model. Second, even 
for cases where 0 is block sparse, Standard grouped l\ norms require information about the partitioning 
of 0. This can be rather restrictive in practice. The adoption of overlapping groups has been proposed 
as a possible solution. Assuming that every parameter belongs to at least one group, such models lead 
to optimization tasks that, in many cases, are not hard to solve, for example, by resorting to proximal 
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methods [6,7]. Moreover, by using properly defined overlapping groups [71], the allowed sparsity 
patterns can be constrained to form hierarchiccil structures, such as connected and rooted trees and 
subtrees that are met, for example, in multiscale (wavelet) decompositions. In Fig. 10. 6C, an example 
of an l \/l 2 ball for overlapping groups is shown. 

Besides the previously stated directions, extensions of the compressed sensing principies to cope 
with structured sparsity led to the model-based compressed sensing [10,26]. The (k, C) model allows 
the significant parameters of a ^-sparse signal to appear in at most C clusters, whose size is unknown. 
In Section 9.9, it was commented that searching for a ^-sparse solution takes place in a union of 
subspaces, each one of dimensionality k. Imposing a certain structure on the target solution restricts the 
searching in a subset of these subspaces and leaves a number of these out of the game. This obviously 
facilitates the optimization task. In [27], structured sparsity is considered in terms of graphical models, 
and in [110] the C-HiLasso group sparsity model was introduced, which allows each block to have a 
sparse structure itself. Theoretical results that extend the RIP to the block RIP have been developed 
and reported (see, for example, [18,83]) and in the algorithmic front, proper modifications of greedy 
algorithms have been proposed in order to provide structured sparse Solutions [53]. 

In [24], it is suggested to replace the l\ norm by a weighted version of it. To justify such a choice, 
let us recall Example 9.2 and the case where the “unknown” system was sensed using x = [2, l] 7 . We 
have seen that by “blowing up” the l \ ball, the wrong sparse solution was obtained. Let us now replace 
the i\ norm in (9.21) with its weighted version. 


0|Ii,uj : = Whl^il + w 2 \0 2 \, W\,W 2 >0, 


and set u>i = 4 and w\ = 1. Fig. 10.7A shows the isovalue curve |#|| = 1, together with that result- 

ing from the Standard i \ norm. The weighted one is sharply “pinched” around the vertical axis, and the 
larger the value of u )\, compared to that of w 2 , the sharper the corresponding ball will be. Fig. 10. 7B 
shows what happens when “blowing up” the weighted l\ ball. It will first touch the point (0, 1), which 
is the true solution. Basically, what we have done is “squeeze” the ball to be aligned more to the 
axis that contains the (sparse) solution. For the case of our example, any weight w i > 2 would do the 
job. 

Consider now the general case of a weighted norm, 



wj > 0,: weighted l\ norm. 


(10.16) 


The ideal choice of the weights would be 



where 0 o is the target true vector, and where we have tacitly assumed that 0 • oo = 0. In other words, 
the smaller a parameter is, the larger the respective weight becomes. This is justified, because large 
weighting will force respective parameters toward zero during the minimization process. Of course, in 
practice the values of the true vector are not known, so it is suggested to use their estimates during each 
iteration of the minimization procedure. The resulting scheme is of the following form. 
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FIGURE 10.7 

(A) The isovalue curves for the l\ and the weighted £i norms for the same value. The weighted l\ is sharply 
pinched around one of the axes, depending on the weights. (B) Adopting to minimize the weighted l \ norm for the 
setup of Fig. 9.9C, the correct sparse solution is obtained. 


Algorithm 10.3. 

1. Initialize weights to unity, vr l} = I, / = 1.2,...,/. 

2. Minimize the weighted l\ norm, 

e (l) =argmin 0sR( ||0|| ljU; 
s.t. y = X0. 


3. Update the weights 


w 


(<+i) 




j = 1,2 ,...,/. 


4. Terminate when a stopping criterion is met, otherwise return to step 2. 


The constant e is a small user-defined parameter to guarantee stability when the estimates of the 
coefficients take very small values. Note that if the weights have constant preselected values, the task 
retains its convex nature; this is no longer true when the weights are changing. It is interesting to 
point out that this intuitively motivated weighting scheme can resuit if the l\ norm is replaced by 
E = iln(|0;| + e ) as the regularizing term of (9.6). Fig. 10.8 shows the respective graph in the one- 
dimensional space together with that of the l\ norm. The graph of the logarithmic function reminds 
us of the i p , p < 0 < 1, “norms” and the comments made in Section 9.2. This is no longer a convex 
function, and the iterative scheme given before is the resuit of a majorization-minimization procedure 
in order to solve the resuiting nonconvex task [24] (Problem 10.6). 
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e 


FIGURE 10.8 

One-dimensional graphs of the i\ norm and the logarithmic regularizer ln f ^ + l) = in (101 + e) — in e, with 
e = 0.1. The term lne was subtracted for illustration purposes only and does not affect the optimization. Note the 
nonconvex nature of the logarithmic regularizer. 


The concept of iterative weighting, as used before, has also been applied in the context of the 
iterative reweighted LS algorithm. Observe that the l\ norm can be written as 


l 

O h = J2\ej\ = 0 T WeO, 
7=1 


where 


" 1 

W\ 


m = 


0 


0 

1 

l^i 


0 

0 


and where in the case of 6j = 0, for some i e {1, 2,..., /}, the respective coefficient of Wg is defined 
to be 1. If Wg were a constant weighting matrix, that is, Wg := Ws, for some hxed 0, then obtaining 
the minimum 


6 = argmin 9sR / ||_y — X0\\\ + XO 1 WqO, 

would be straightforward and similar to the ridge regression. In the iterative reweighted scheme, Wg 
is replaced by Wgu), formed by using the respected estimates of the parameters, which have been 
obtained from the previous iteration, that is, 0 := 0 (, \ as we did before. In the sequel, each iteration 
solves a weighted ridge regression task. 

The focal underdetermined system solver (FOCUSS) algorithm [64] was the first one to use the 
concept of iterative-reweighted-least-squares (IRLS) to represent t p , p < 1, as a weighted norm in 
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order to find a sparse solution to an underdetermined system of equations. This algorithm is also of 
historical importance, because it is among the very first ones to emphasize the importance of sparsity; 
moreover, it provides comprehensive convergence analysis as well as a characterization of the station- 
ary points of the algorithm. Variants of this basic iterative weighting scheme have also been proposed 
(see, e.g., [35] and the references therein). 

In [126], the elastic net regularization penalty was introduced, which combines the £2 and £ 1 con- 
cepts together in a tradeoff fashion, that is, 

l 

k^(a0 2 + (l-a)|0,|), 
i =1 

where a is a user-defined parameter controlling the influence of each individual term. The idea behind 
the elastic net is to combine the advantages of the LASSO and the ridge regression. In problems where 
there is a group of variables in x that are highly correlated, LASSO tends to select one of the corre- 
sponding parameters in 6 , and set the rest to zero in a rather arbitrary fashion. This can be understood 
by looking carefully at how the greedy algorithms work. When sparsity is used to select the most im¬ 
portant of the variables in x (feature selection), it is better to select ali the relevant components in the 
group. If one knew which of the variables are correlated, he/she could form a group and then use the 
group LASSO. However, if this is not known, involving the ridge regression offers a remedy to the 
problem. This is because the £2 penalty in ridge regression tends to shrink the coefficients associated 
with correlated variables toward each other (e.g., [68]). In such cases, it would be better to work with 
the elastic net rationale that involves LASSO and ridge regression in a combined fashion. 

In [23], the LASSO task is modified by replacing the squared error term with one involving corre- 
lations, and the minimization task becomes 

6 : min II0II., 

em 1 

s.t. ||z r (y-X0)|| oo <e, 

where e is related to I and the noise variance. This task is known as the Dantzig selector. That is, instead 
of constraining the energy of the error, the constraint now imposes an upper limit to the correlation of 
the error vector with any of the columns of X. In [5,15], it is shown that under certain conditions, the 
LASSO estimator and the Dantzig selector become identical. 

Total variation (TV) [107] is a closely related to i\ sparsity promoting notion that has been widely 
used in image processing. Most of the grayscale image arrays, I e R /x/ , consist of slowly varying 
pixel intensities except at the edges. As a consequence, the discrete gradient of an image array will be 
approximately sparse (compressible). The discrete directional derivatives of an image array are defined 
pixel-wise as 


V,(/)0\ j) := I(i + 1,;) - /(i, j), V/ e {1,2,. 

V y (/)(/, j) := I(i, j + 1) - /(i, j), V/ e [1,2,. 

1}, 

(10.17) 

(10.18) 

V, (/)(/, j) := Vy (/)(), /) := 0, V/, j e [1,2, .. 


(10.19) 
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The discrete gradient transform 


V : 


»/x/ 


j)/ x2/ 


is defined in terms of a matrix form as 


V(/)(/, j) := [V,(i, j), V y (i, j)], Vi, j e {1,2,. .(10.20) 


The total variation of the image array is dehned as the l\ norm of the magnitudes of the elements of 
the discrete gradient transform, that is, 


l l 


\\I IItv EE II' V (/)(|, j) II2 = EE y/(Vx(.MJ)) 2 + (Vy{WJ)) 2 . 


( 10 . 21 ) 


i=1 j=1 


(=1 j=l 


Note that this is a mixture of ij and i \ norms. The sparsity promoting optimization around the total 
variation is defined as 


/* e argrnin ||/|| r y , 

s.t. \\y-T(I)\\ 2 <e, (10.22) 

where y e S, N is the observations vector and Td) denotes the resuit in vectorized form of the ap- 
plication of a linear operator on 1. For example, this could be the resuit of the action of a partial 
two-dimensional DFT on the image. Subsampling of the DFT matrix as a means of forming sensing 
matrices has already been discussed in Section 9.7.2. The task in (10.22) retains its convex nature and 
it basically expresses our desire to reconstruet an image that is as smooth as possible, given the avail- 
able observations. The NESTA algorithm can be used for solving the total variation minimization task; 
besides NESTA, other efficient algorithms for this task can be found in, for example, [63,120]. 

It has been shown in [22] for the exact measurements case (e = 0), and in [94] for the erroneous 
measurements case, that conditions and bounds that guarantee recovery of an image array from the task 
in (10.22) can be derived and are very similar to those we have discussed for the case of the i\ norm. 

Example 10.1. (Magnetic resonance imaging (MRI)). In contrast to ordinary imaging systems, which 
directly acquire pixel samples, MRI scanners sense the image in an encoded form. Specifically, MRI 
scanners sample components in the spatial frequency domain, known as “A:-space” in the MRI nomen- 
clature. If ali the components in this transform domain were available, one could apply the inverse 
2D-DFT to recover the exact MR image in the pixel domain. Sampling in the l-space is realized along 
particular trajectories in a number of successive acquisitions. This process is time consuming, merely 
due to physical constraints. As a resuit, techniques for efficient image recovery from a limited num¬ 
ber of observations is of high importance, because they can reduce the required acquisition time for 
performing the measurements. Long acquisition times are not only inconvenient but even impossible, 
because the patients have to stay stili for long time intervals. Thus, MRI was among the very first 
applications where compressed sensing found its way to offering its elegant Solutions. 

Fig. 10. 9A shows the “famous” Shepp-Logan phantom, and the goal is to recover it via a lim¬ 
ited number of samples (measurements) in its frequency domain. The MRI measurements are taken 
across 17 radial lines in the spatial frequency domain, as shown in Fig. 10. 9B. A “naive” approach to 
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FIGURE 10.9 

(A) The original Shepp-Logan image phantom. (B) The white lines indicate the directions across which the sam- 
pling in the spatial Fourier transform were obtained. (C) The recovered image after applying the inverse DFT, 
having first filled with zeros the missing values in the DFT transform. (D) The recovered image using the total 
variation minimization approach. 


recovering the image from this limited number of samples would be to adopt a zero-filling rationale 
for the missing components. The recovered image according to this technique is shown in Fig. 10. 9C. 
Fig. 10. 9D shows the recovered image using the approach of minimizing the total variation, as ex- 
plained before. Observe that the results for this case are astonishingly good. The original image is 
almost perfectly recovered. The constrained minimization was performed via the NESTA algorithm. 
Note that if the minimization of the l \ norm of the image array were used in place of the total variation, 
the results would not be as good; the phantom image is sparse in the discrete gradient domain, because 
it contains large sections that share constant intensities. 


10.4 ONLINE SPARSITY PROMOTING ALGORITHMS 

In this section, online schemes for sparsity-aware learning are presented. There are a number of reasons 
why one has to resort to such schemes. As has already been noted in previous chapters, in various signal 
Processing tasks the data arrive sequentially. Under such a scenario, using batch processing techniques 
to obtain an estimate of an unknown target parameter vector would be highly inefficient, because the 
number of training points keeps increasing. Such an approach is prohibited for real-time applications. 
Moreover, time-recursive schemes can easily incorporate the notion of adaptivity, when the learning 
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environment is not stationary but undergoes changes as time evolves. Besides signal processing appli¬ 
catioris, there are an increasing number of machine learning applications where Online processing is of 
paramount importance, such as bioinformatics, hyperspectral imaging, and data mining. In such appli¬ 
cations, the number of training points easily amounts to a few thousand up to hundreds of thousands of 
points. Concerning the dimensionality of the ambient (feature) space, one can claim numbers that lie 
in similar ranges. For example, in [82], the task is to search for sparse Solutions in feature spaces with 
dimensionality as high as 10 9 having access to data sets as large as 10 7 points. Using batch techniques 
on a single computer is out of the question with today’s technology. 

The setting that we have adopted for this section is the same as that used in previous chapters (e.g., 
Chapters 5 and 6). We assume that there is an unknown parameter vector that generates data according 
to the Standard regression model 

y n =xl$ + r}„, Vn, 

and the training samples are received sequentially (y n ,x n ), n = 1,2, _In the case of a stationary 

environment, we would expect our algorithm to converge asymptotically, as n —> oo, to or “nearly 
to” the true parameter vector that gives birth to the observations, y n , when it is sensed by x n . For 
time-varying environments, the algorithms should be able to track the underlying changes as time goes 
by. Before we proceed, a comment is important. Because the time index, n, is left to grow, all we have 
said in the previous sections with respect to underdetermined systems of equations loses its meaning. 
Sooner or later we are going to have more observations than the dimension of the space in which 
the data reside. Our major concern here becomes the issue of asymptotic convergence for the case of 
stationary environments. The obvious question that is now raised is, why not use a Standard algorithm 
(e.g., LMS, RLS, or APSM), as we know that these algorithms converge to, or nearly enough in some 
sense, the solution (i.e., the algorithm will identify the zeros asymptotically)? The answer is that if 
such algorithms are modified to be aware of the underlying sparsity, convergence is significantly sped 
up; in real-life applications, one does not have the “luxury” of waiting a long time for the solution. In 
practice, a good algorithm should be able to provide a good enough solution, and in the case of sparse 
Solutions to obtain the support, after a reasonably small number of iteration steps. 

In Chapter 5, we commented on attempts to modify classical Online schemes (for example, the 
proportionate LMS) in order to consider sparsity. Flowever, these algorithms were of a rather ad hoc 
nature. In this section, the powerful theory around the l \ norm regularization will be used to obtain 
sparsity promoting time-adaptive schemes. 

10.4.1 LASSO: ASYMPTOTIC PERFORMANCE 

When presenting the basic principies of parameter estimation in Chapter 3, the notions of bias, variance, 
and consistency, which are related to the performance of an estimator, were introduced. In a number 
of cases, such performance measures were derived asymptotically. For example, we have seen that 
the maximum likelihood estimator is asymptotically unbiased and consistent. In Chapter 6, we saw 
that the LS estimator is also asymptotically consistent. Moreover, under the assumption that the noise 
samples are i.i.d., the LS estimator, 0„, based on n observations, is itself a random vector that satisfies 
the v / H-estimation consistency, that is, 

Vn (e„ - e a ) —► U (o, ^r- 1 ), 
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where 0 o is the true vector that generates the observations, a “ denotes the variance of the noise source, 
and E is the covariance matrix E[xx r ] of the input sequence, which has been assumed to be zero mean 
and the limit denotes convergence in distribution. 

The LASSO in its (9.6) formulation is the task of minimizing the t\ norm regularized version of 
the LS cost. However, nothing has been said so far about the statistical properties of this estimator. The 
only performance measure that we referred to was the error norm bound given in (9.36). However, this 
bound, although important in the context for which it was proposed, does not provide much statisti¬ 
cal information. Since the introduction of the LASSO estimator, a number of papers have addressed 
problems related to its statistical performance (see, e.g., [45,55,74,127]). 

When dealing with sparsity promoting estimators such as the LASSO, two crucial issues emerge: 
(a) whether the estimator, even asymptotically, can obtain the support, if the true vector parameter is 
a sparse one; and (b) to quantify the performance of the estimator with respect to the estimates of 
the nonzero coefficients, that is, those coefficients whose index belongs to the support. Especially for 
LASSO, the latter issue becomes to study whether LASSO behaves as well as the unregularized LS 
with respect to these nonzero components. This task was addressed for the first time and in a more 
general setting in [55]. Let the support of the true, yet unknown, k-sparse parameter vector 0 O be 
denoted as S. Let also be the k x k covariance matrix E[x|sx^], where X|s e is the random 
vector that contains only the k components of x, with indices in the support S. Then we say that an 
estimator satisfies asymptotically the oracle properties if: 

• lim^oc Prob J = 5} = 1; this is known as support consistency, 

• — ^o|s) —* 7V(0, this is the ^fn-estimation consistency. 

We denote as 0 o \s and 0 „|s the fc-dimensional vectors that resuit from 6 a , 0„, respectively, if we keep 
the components whose indices lie in the support S, and the limit is meant in distribution. In other 
words, according to the oracle properties, a good sparsity promoting estimator should be able to predict, 
asymptotically, the true support and its performance with respect to the nonzero components should 
be as good as that of a genie-aided LS estimator, which is informed in advance of the positions of the 
nonzero coefficients. 

Unfortunately, the LASSO estimator cannot satisfy simultaneously both conditions. It has been 
shown [55,74,127] that: 

• For support consistency, the regularization parameter X should be time-varying, such that 

X„ Xii 

lim —= = oo, lim — = 0. 

n->oo A Jn n—>oo n 


That is, X n must grow faster than *Jn, but slower than n. 
• For ^/«-consistency, X n must grow as 


lim —JL = 0, 

n->oo ^Jn 


that is, it grows slower than sfn. 
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The previous two conditions are conflicting and the LASSO estimator cannot comply with the two 
oracle conditions simultaneously. The proofs of the previous two points are somewhat technical and are 
not given here. The interested reader can obtain them from the previously given references. However, 
before we proceed, it is instructive to see why the regularization parameter has to grow more slowly 
than n, in any case. Without being too rigorous mathematically, recall that the LASSO solution comes 
from Eq. (9.6). This can be written as 



(10.23) 


where we have divided both sides by n. Taking the limit as n —> oo, if ). n / n —> 0, then we are 


left with the brst two terms; this is exactly what we would have if the unregularized sum of squared 
errors had been chosen as the cost function. Recall from Chapter 6 that in this case, the solution 
asymptotically converges 1 (under some general assumptions, which are assumed to hold true here) 
to the true parameter vector; that is, we have strong consistency. 


10.4.2 THE ADAPTIVE NORM-WEIGHTED LASSO 


There are two ways to get out of the previously stated conflict. One is to replace the l 1 norm with a 
nonconvex function, which can lead to an estimator that satisfies the oracle properties simultaneously 
[55]. The other is to modify the l\ norm by replacing it with a weighted version. Recall that the 
weighted l\ norm was discussed in Section 10.3 as a means to assist the optimization procedure to 
unveil the sparse solution. Here the notion of weighted l\ norm comes as a necessity imposed by our 
desire to satisfy the oracle properties. This gives rise to the adaptive time-and-norm-weighted LASSO 
(TNWL) cost estimate, defined as 



(10.24) 


where /3 < 1 is used as the forgetting factor to allow for tracking slow variations. The time-varying 
weighting sequences is denoted as w/,„. There are different options. In [127] and under a stationary 
environment with /3 — 1, it is shown that if 


1 



where 0f st is the estimate of the ith component obtained by any ^/n --consistent estimator, such as the 
unregularized LS, then for specific choices of and y the corresponding estimator satisfies the oracle 
properties simultaneously. The main reasoning behind the weighted l\ norm is that as time goes by, and 


the ^«-consistent estimator provides better and better estimates, the weights corresponding to indices 


outside the true support (zero values) are inflated and those corresponding to the true support converge 


1 Recall that this convergence is with probability 1. 
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to a finite value. This helps the algorithm simultaneously to locate the support and obtain unbiased 
(asymptotically) estimates of the large coefficients. 

Another choice for the weighting sequence is related to the smoothly clipped absolute deviation 
(SCAD) [55,128]. This is defined as 



where / (•) stands for the characteristic function, /x„ = X n /n, and a > 2. Basically, this corresponds to 
a quadratic spline function. It turns out [128] that if is chosen to grow faster than y/n and slower 
than n, the adaptive LASSO with /1=1 satisfies both Oracle conditions simultaneously. 

A time-adaptive scheme for solving the TNWL LASSO was presented in [3]. The cost function of 
the adaptive LASSO in (10.24) can be written as 


J(0) = 6 T R n 0-rf 1 0 + X n \\O\\ hWl 


where 


n 


n 



and \\6 1| i w is the weighted t\ norm. We know from Chapter 6, and it is straightforward to see, that 


Rn — fiR-n— 1 T X n X u , V n — /lr n —\ -\- y n x n . 


The complexity for both of the previous updates, for matrices of a general structure, amounts to 0(l 2 ) 
multiplication/addition operations. One alternative is to update R n and r„ and then solve a convex op- 
timization task for each time instant, n, using any Standard algorithm. However, this is not appropriate 
for real-time applications, due to its excessive computational cost. In [3], a time-recursive version of 
a coordinate descent algorithm has been developed. As we saw in Section 10.2.2, coordinate descent 
algorithms update one component at each iteration step. In [3], iteration steps are associated with time 
updates, as is always the case with the Online algorithms. As each new training pair (y n , x n ) is re- 
ceived, a single component of the unknown vector is updated. Hence, at each time instant, a scalar 
optimization task has to be solved, and its solution is given in closed form, which results in a simple 
soft thresholding operation. One of the drawbacks of the coordinate techniques is that each coefficient 
is updated every / time instants, which, for large values of /, can slow down convergence. Variants of 
the basic scheme that cope with this drawback are also addressed in [3], referred to as online cyclic co¬ 
ordinate descent time-weighted LASSO (OCCD-TWL). The complexity of the scheme is of the order 
of 0(l 2 ). Computational savings are possible if the input sequence is a time series and fast schemes 
for the updates of R n and the RLS can then be exploited. However, if an RLS-type algorithm is used in 
parallel, the convergence of the overall scheme may be slowed down, because the RLS-type algorithm 
has to converge first in order to provide reliable estimates for the weights, as pointed out before. 
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10.4.3 ADAPTIVE CoSaMP ALGORITHM 

In [90], an adaptive version of the CoSaMP algorithm, whose steps are summarized in Algorithm 10.2, 
was proposed. Iteration steps i now coincide with time updates n, and the LS solver in step 3c of the 
general CSMP scheme is replaced by an LMS one. 

Let us focus first on the quantity X T e ( '~ 1 , in step 3a of the CSMP scheme, which is used to 
compute the support at iteration i. In the online setting and at (iteration) time n, this quantity is now 
“rephrased” as 

n —1 

X T e„-i = y^Xjej. 

7=1 

In order to make the algorithm flexible to adapt to variations of the environment as the time index 
n increases, the previous correlation sum is modified to 

n— 1 

P„ '=Yl,P n ~ l ~ lx j e j = PPn-l+ x n-\e n -l- 
7=1 

The LS task, constrained on the active columns that correspond to the indices in the support S in 
step 3c, is performed in an online rationale by involving the basic LMS recursions, that is, 2 

e„ :=y„ -x^ s 0\s(n - 1), 

0\s(n) '■= 0\ s (n - 1) + px„\ S e n , 


where 0\s(-) and x n \s denote the respective subvectors corresponding to the indices in the support S. 
The resuiting algorithm is summarized as follows. 

Algorithm 10.4 (The AdCoSaMP scheme). 


1. Select the value of t — 2k. 

2. Initialize the algorithm: 0(1) = 0, 0(1) — 0. p | = 0, e\ — vi. 

3. Choose \x and fi. 

4. For n = 2, 3,..., execute the following steps: 

(a) p n = Pp n _i +x n -\e n -\. 

(b) Obtain the current support: 


S = supp{0(« — 1)} U 


indices of the t largest in 
magnitude components of p n 


(c) Perform the LMS update: 


e n = y n ~ x l\ s 0 ]s (n - 1), 
0\ S (n) = 0\s(n - 1) + p.x n \ S e n . 


2 


The time index for the parameter vector is given in parentheses, due to the presence of the other subscripts. 
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(d) Obtain the set Sk of the indices of the k largest components of 0\s{ri). 

(e) Obtain 0(n) such that 


0\S k (n) = 0\S k and 6\ S ‘ k (n) = 0, 


where S' k is the complement set of Sk- 

(f) Update the error: e n = y n — xf,$(n). 

In place of the Standard LMS, its normalized version can alternatively be adopted. Note that step 4e is 
directly related to the hard thresholding operation. 

In [90], it is shown that if the sensing matrix, which is now time dependent and keeps increasing in 
size, satisfies a condition similar to RIP for each time instant, called exponentially weighted isometry 
property (ERIP), which depends on fi, then the algorithm asymptotically satisfies an error bound, which 
is similar to the one that has been derived for CoSaMP in [93], plus an extra term that is due to the 
excess mean-square error (see Chapter 5), which is the price paid for replacing the LS solver by the 


LMS. 


10.4.4 SPARSE-ADAPTIVE PR0JECTI0N SUBGRADIENT METH0D 


The APSM family of algorithms was introduced in Chapter 8, as one among the most popular tech- 
niques for online/adaptive learning. As pointed out there, a major advantage of this algorithmic family 
is that one can readily incorporate convex constraints. In Chapter 8, APSM was used as an alterna- 
tive to methods that build around the sum of squared errors cost function, such as the LMS and the 
RLS. The rationale behind APSM is that because our data are assumed to be generated by a regression 
model, the unknown vector could be estimated by finding a point in the intersection of a sequence of 
hyperslabs that are defined by the data points, that is, S„[e] := etf : \y n — xj t 6 1 < e} . Also, it was 
pointed out that such a model is very natural when the noise is bounded. When dealing with sparse 
vectors, there is an additional constraint that we want our solution to satisfy, that is, ||01| j < p (see also 
the LASSO formulation (9.7)). This task fits nicely in the APSM rationale and the basic recursion can 
be readily written, without much thought or derivation, as follows. For any arbitrarily chosen initial 
point 0q, define V», 



(10.25) 


where q > 1 is the number of hyperslabs that are considered each time and //„ is a user-defined ex- 
trapolation parameter. In order for convergence to be guaranteed, theory dictates that it must lie in the 
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interval (0, 2A4 n ), where 



( 10 . 26 ) 


if I £•=„-,+ ! 4 )p smi0n-l) - 0 n- I | # o, 


1, otherwise, 


and PB tl [p](-) is the projection operator onto the l\ ball /f (i [/)] := {0 e R 1 : II^IR < p}, because the 
solution is constrained to lie within this ball. Note that recursion (10.25) is analogous to the iterative 
soft thresholding shrinkage algorithm in the batch processing case (10.7). There, we saw that the only 
difference the sparsity imposes on an iteration, with respect to its unconstrained counterpart, is an 
extra soft thresholding. This is exactly the case here. The term in parentheses is the iteration for the 
unconstrained task. Moreover, as shown in [46], projection on the 1 1 ball is equivalent to a soft thresh¬ 
olding operation. Following the general arguments given in Chapter 8, the previous iteration converges 
arbitrarily close to a point in the intersection 


B ti [5] n 0 S„[€], 


n>no 



FIGURE 10.10 


Geometric illustration of the update steps involved in the SpAPSM algorithm, for the case of q = 2. The update at 
time n is obtained by first convexly combining the projections onto the current and previously formed hyperslabs, 
S«[e], S„_i[e], and then projecting onto the weighted i\ ball. This brings the update closer to the target solution 9 a . 
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for some finite value of iiq. In [76,77], the weighted l\ ball (denoted here as Bf ] \ WnP \) has been 
used to improve convergence as well as the tracking speed of the algorithm, when the environment is 
time-varying. The weights were adopted in accordance with what was discussed in Section 10.3, that 
is, 


I 

I @i,n— 11 “E 


Vi e {1,2,...,/}, 


where ( e n ) n >o is a sequence (can be also constant) of small numbers to avoid division by zero. The 
basic time iteration becomes as follows. For any arbitrarily chosen initial point 6q, define Vn, 


0„ — PB t [w n ,p] I On-l 



^5;[e] (0/1 —l) - 0»-l 


( 10 . 27 ) 


where /i„ e (0, 2Ai n ) and is given in (10.26). Fig. 10.10 illustrates the associated geometry of the 
basic iteration in R 2 and for the case of q = 2. It comprises two parallel projections on the hyperslabs, 
followed by one projection onto the weighted l\ ball. In [76], it is shown (Problem 10.7) that a good 
bound for the weighted l\ norm is the sparsity level k of the target vector, which is assumed to be 
known and is a user-defined parameter. In [76], it is shown that asymptotically, and under some general 
assumptions, this algorithmic scheme converges arbitrarily close to the intersection of the hyperslabs 
with the weighted i\ balls, that is. 


f~'| iyPB tl [w„.p] n S^fe]^ , 

n>n 0 

for some nonnegative integer hq. It has to be pointed out that in the case of weighted l\ norms, the 
constraint is time-varying and the convergence analysis is not covered by the Standard analysis used 
for APSM, and had to be extended to this more general case. 

The complexity of the algorithm amounts to O(ql). The larger q, the faster the convergence rate, at 
the expense of higher complexity. In [77], in order to reduce the dependence of the complexity on q, the 
notion of the subdimensional projection is introduced, where projections onto the q hyperslabs could 
be restricted along the directions of the most significant parameters of the currently available estimates. 
The dependence on q now becomes 0(qk n ), where k n is the sparsity level of the currently available 
estimate, which after a few steps of the algorithm gets much lower than I. The total complexity amounts 
to 0(1 ) + 0(qk n ) per iteration step. This allows the use of large values of q, which (at only a small 
extra computational cost compared to 0(1)) drives the algorithm to a performance close to that of the 
adaptive weighted LASSO. 

Projection Onto the Weightedl\ Ball 

Projecting onto an i\ ball is equivalent to a soft thresholding operation. Projection onto the weighted l\ 
norm results in a slight variation of the soft thresholding, with different threshold values per component. 
In the sequel, we give the iteration steps for the more general case of the weighted l\ ball. The proof 
is a bit technical and lengthy and it will not be given here. It was derived, for the first time, via purely 
geometric arguments, and without the use of the classical Lagrange multipliers, in [76]. Lagrange 
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multipliers have been used instead in [46], for the case of the l\ ball. The efficient computation of the 
projection on the l\ ball was treated earlier, in a more general context, in [100]. 

Recall from the definition of a projection, discussed in Chapter 8, that given a point outside the ball 
6 eR 6 7 \ Bifw, p], its projection onto the weighted i\ ball is the point PB tl [w,p]^fi) e Bgfw, p] := 

[z e R 7 : Yl\=i Wi |z,j < p} that lies closest to 6 in the Euclidean sense. If 6 lies within the ball, then it 
coincides with its projection. Given the weights and the value of p, the following iterations provide the 
projection. 


Algorithm 10.5 (Projection onto the weighted l\ ball /if [ie, p]). 


1. 

2 . 


3. 

4. 


5. 


Form the vector [|$il/u<i,..., \0i\/wi] T e R 7 . 

Sort the previous vector in a nonascending order, so that |0T(i)l/Wr(i) > ... > |0 T (/)l/Wr(/)- The 
notation r stands for the permutation, which is implicitly defined by the sorting operation. Keep in 
mind the inverse , which is the index of the position of the element in the original vector. 


H :=/. 

Let m = 1. While m < /, do 

(a) m*:=m. 

(b) Find the maximum y* among those j e [1, 2,..., r m } such that ^ 7(7 ^ > 

w r(j) 

(c) If — r m then break the loop. 

(d) Otherwise set r m +i := /„. 

(e) Increase m by 1 and go back to Step 4a. 

Form the vector p e W m * whose /th component, j = 1,..., r nit , is given by 


E/= t «'t(/)|0 T (/)I 


e;=i 


t(0 


~ P 


Pj : = \ 0 rU)\ 


E;'=*l u; r(i')l^r(i)l P 


E r m* , 

i= 1 


y r(;)■ 


'm 


6 . Use the inverse mapping r -1 to insert the element /3 ; into the r -1 (j) position of the /-dimensional 
vector p, Wj e [1,2,... r mt ), and fili in the rest with zeros. 

7. The desired projection is PB h [w,p](0) = [sgn(6>i)pi,..., sgn (9i)pi] T . 

Remarks 10.3. 


• Generalized thresholding rules: Projections onto both l\ and weighted t\ balls impose convex spar- 
sity inducing constraints via properly performed soft thresholding operations. More recent advances 
within the SpAPSM framework [78,109] allow the substitution of Pb 1{ [p\ and Pp^ \ w .p\ with a gen¬ 
eralized thresholding, built around the notions of SCAD, nonnegative garrote, and a number of 
thresholding functions corresponding to the nonconvex i p , p < 1, penalties. Moreover, it is shown 
that such generalized thresholding (GT) operators are nonlinear mappings with theirfixed point set 
being a union of subspaces, that is, the nonconvex object that lies at the heart of any sparsity pro- 
moting technique. Such schemes are very useful for low values of q, where one can improve upon 
the performance obtained by the LMS-based AdCoSAMP, at comparable complexity levels. 

• A comparative study of various Online sparsity promoting low-complexity schemes, including the 
proportionate LMS, in the context of the echo cancelation task, is given in [80]. It turns out that the 
SpAPSM-based schemes outperform LMS-based sparsity promoting algorithms. 
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• More algorithms and methods that involve sparsity promoting regularization, in the context of more 
general convex loss functions, compared to the squared error, are discussed in Chapter 8, where 
related references are provided. 

• Distributed sparsity promoting algorithms: Besides the algorithms reported so far, a number of 
algorithms in the context of distributed learning have also appeared in the literature. As pointed out 
in Chapter 5, algorithms complying to the consensus as well as the diffusion rationale have been 
proposed (see, e.g., [29,39,89,102]). A review of such algorithms appears in [30]. 



Tterations 


FIGURE 10.11 

MSE learning curves for AdCoSAMP, SpAPSM, and OCCD-TWL for the simulation in Example 10.2. The vertical 
axis shows the log 10 of the mean-square, that is, log 10 j||s — tyO,, || 2 , and the horizontal axis shows the time index. 
At time n = 1500, the system undergoes a sudden change. 


Example 10.2 (Time-varying signal). In this example, the performance curves of the most popular On¬ 
line algorithms, mentioned before, are studied in the context of a time-varying environment. A typical 
simulation setup, which is commonly adopted by the adaptive filtering community in order to study 
the tracking ability of an algorithm, is that of an unknown vector that undergoes an abrupt change af- 
ter a number of observations. Here, we consider a signal, s, with a sparse wavelet representation, that 
is, s — where 'I' is the corresponding transformation matrix. In particular, we set / = 1024 with 
100 nonzero wavelet coefhcients. After 1500 observations, 10 arbitrarily picked wavelet coefficients 
change their values to new ones, selected uniformly at random from the interval [—1, 1]. Note that this 
may affect the sparsity level of the signal, and we can now end with up to 110 nonzero coefficients. 
A total of N = 3000 sensing vectors are used, which resuit from the wavelet transform of the input 
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vectors x„ e R', n — 1,2,, 3000, having elements drawn from Af( 0, 1). In this way, the Online al- 
gorithms do not estimate the signal itself, but its sparse wavelet representation, 6. The observations are 
corrupted by additive white Gaussian noise of variance cr~ = 0.1. Regarding SpAPSM, the extrapola- 
tion parameter /x„ is set equal to 1.8 x A4 n , a»!"* are all getting the same value 1 /q, the hyperslabs 
parameter 6 was set equal to 1.3er„, and q — 390. The parameters for all algorithms were selected in 
order to optimize their performance. Because the sparsity level of the signal may change (from k = 100 
up to k = 110) and because in practice it is not possible to know in advance the exact value of k, we 
feed the algorithms with an overestimate, k, of the true sparsity value, and in particular we use k = 150 
(i.e., 50% overestimation up to the 1500th iteration). 

The results are shown in Fig. 10.11. Note the enhanced performance obtained via the SpAPSM 
algorithm. However, it has to be pointed out that the complexity of the AdCoSAMP is much lower 
compared to the other two algorithms, for the choice of q = 390 for the SpAPSM. The interesting 
observation is that SpAPSM achieves a better performance compared to OCCD-TWL, albeit at signif- 
icantly lower complexity. If on the other hand complexity is of major concern, use of SpAPSM offers 
the flexibility to use generalized thresholding operators, which lead to improved performance for small 
values of q, at complexity comparable to that of LMS-based sparsity promoting algorithms [79,80]. 


10.5 LEARNING SPARSE ANALYSIS M0DELS 

Our whole discussion so far has been spent in the terrain of signals that are either sparse themselves or 
that can be sparsely represented in terms of the atoms of a dictionary in a synthesis model, as introduced 
in (9.16), that is, 

iel 

As a matter of fact, most of the research activity has been focused on the synthesis model. This may 
be partly due to the fact that the synthesis modeling path may provide a more intuitively appealing 
structure to describe the generation of the signal in terms of the elements (atoms) of a dictionary. 
Recall from Section 9.9 that the sparsity assumption was imposed on 0 in the synthesis model and 
the corresponding optimization task was formulated in (9.38) and (9.39) for the exact and noisy cases, 
respectively. 

However, this is not the only way to approach the task of sparse modeling. Very early in this chapter, 
in Section 9.4, we referred to the analysis model 

s = <t> H s 


and pointed out that in a number of real-life applications, the resulting transform s is sparse. To be 
fair, the most orthodox way to deal with the underlying model sparsity would be to consider || <t > H s || Q . 
Thus, if one wants to estimate s, a very natural way would be to cast the related optimization task as 


min 

S 


® H s 


y = Xs, or ||_y — Zsllj < e, 


s.t. 


(10.28) 
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depending on whether the measurements via a sensing matrix X are exact or noisy. Strictly speaking, 
the total variation minimization approach, which was used in Example 10.1, falis under this analysis 
model formulation umbrella, because what is minimized is the l\ norm of the gradient transform of 
the image. 

The optimization tasks in either of the two formulations given in (10.28) build around the assump- 
tion that the signal of interest has sparse analysis representation. The obvious question that is now 
raised is whether the optimization tasks in (10.28) and their counterparts in (9.38) or (9.39) are any 
different. One of the first efforts to shed light on this problem was in [51]. There, it was pointed out 
that the two tasks, though related, are in general different. Moreover, their comparative performance 
depends on the specific problem at hand. Let us consider, for example, the case where the involved 
dictionary corresponds to an orthonormal transformation matrix (e.g., DFT). In this case, we already 
know that the analysis and synthesis matrices are related as 

$ — q/ = ; 

which leads to an equivalence between the two previously stated formulations. Indeed, for such a 
transform we have 


■v = <t> H s s = O.v . 

Analysis Synthesis 

Using the last formula in (10.28), the tasks in (9.38) or (9.39) are readily obtained by replacing 6 by s. 
However, this reasoning cannot be extended to the case of overcomplete dictionaries; in these cases, 
the two optimization tasks may lead to different Solutions. 

The previous discussion concerning the comparative performance between the synthesis or 
analysis-based sparse representations is not only of “philosophical” value. It turns out that often in 
practice, the nature of certain overcomplete dictionaries does not permit the use of the synthesis-based 
formulation. These are the cases where the columns of the overcomplete dictionary exhibit a high de- 
gree of dependence; that is, the coherence of the matrix, as defined in Section 9.6.1, has large values. 
Typical examples of such overcomplete dictionaries are the Gabor frames, the curvelet frames, and 
the oversampled DFT. The use of such dictionaries leads to enhanced performance in a number of 
applications (e.g., [111,112]). Take as an example the case of our familiar DFT transform. This trans¬ 
form provides a representation of our signal samples in terms of sampled exponential sinusoids, whose 
integral frequencies are multiples of =T, that is, 

l-i 

= £«/*/, (10.29) 

i=0 

where, now, ?,• are the DFT coefficients and (A, is the sampled sinusoid with frequency equal to =pY, 
that is, 


si 

S2 

si -1 
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exp 


exp 


(-j^f-id- d ) 


(10.30) 


However, this is not necessarily the most efficient representation. For example, it is highly unlikely 
that a signal comprises only integral frequencies and only such signals can resuit in a sparse represen¬ 
tation using the DFT basis. Most probably, in general, there will be frequencies lying in between the 
frequency samples of the DFT basis that resuit in nonsparse representations. Using these extra frequen¬ 
cies, a much better representation of the frequency content of the signal can be obtained. However, in 
such a dictionary, the atoms are no longer linearly independent, and the coherence of the respective 
(dictionary) matrix increases. 

Once a dictionary exhibits high coherence, then there is no way of finding a sensing matrix X so 
that X T obeys the RIP. Recall that at the heart of sparsity-aware learning lies the concept of stable em- 
bedding, which allows the recovery of a vector/signal after projecting it on a lower-dimensional space, 
which is what ali the available conditions (e.g., RIP) guarantee. However, no stable embedding is pos- 
sible with highly coherent dictionaries. Take as an extreme example the case where the first and second 
atoms are identical. Then no sensing matrix X can achieve a signal recovery that distinguishes the vec¬ 
tor [1, 0,..., 0] r from [0, 1,0,..., 0] r . Can one then conclude that for highly coherent overcomplete 
dictionaries, compressed sensing techniques are not possible? Fortunately, the answer to this is nega¬ 
tive. After ali, our goal in compressed sensing has always been the recovery of the signal s = TU and 
not the Identification of the sparse vector 0 in the synthesis model representation. The latter was just a 
means to an end. While the unique recovery of 6 cannot be guaranteed for highly coherent dictionaries, 
this does not necessarily cause any problems for the recovery of s, using a small set of measurement 
samples. The escape route will come by considering the analysis model formulation. 


10.5.1 COMPRESSED SENSING FOR SPARSE SIGNAL REPRESENTATION 
IN COHERENT DICTIONARIES 

Our goal in this subsection is to establish conditions that guarantee recovery of a signal vector, which 
accepts a sparse representation in a redundant and coherent dictionary, using a small number of signal- 
related measurements. Let the dictionary at hand be a tight frame T (see appendix of the chapter, that 
can be downloaded from the book’s website). Then our signal vector is written as 

s = T6>, (10.31) 

where 6 is assumed to be k-sparse. Recalling the properties of a tight frame, as they are summarized 
in the appendix, the coefficients in the expansion (10.31) can be written as (ifr s\, and the respective 
vector as 


6> = T r s, 
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because a tight frame is self-dual. Then the analysis counterpart of the synthesis formulation in (9.39) 
can be cast as 


min ||'P :r s||., 

s.t. ||y — Xs ||2 < e. (10.32) 

The goal now is to investigate the accuracy of the recovered solution to this convex optimization task. 
It turns out that similar strong theorems are also valid for this problem, as with the case of the synthesis 
formulation, which was studied in Chapter 9. 

Definition 10.1. Let E& be the union of ali subspaces spanned by all subsets of k columns of 0. 
A sensing matrix X obeys the restricted isometry property adapted to 0 (0-RIP) with S k , if 


(1 — 8 k ) ||s ||2 < j| Xs H 2 < (1 + 8 ^) ||s||? : 0 — RIP condition, 


(10.33) 


for all s e E*. 

The union of subspaces, £*, is the image under 0 of all ^-sparse vectors. This is the difference 
with the RIP definition given in Section 9.7.2. All the random matrices discussed earlier in this chapter 
can be shown to satisfy this form of RIP, with overwhelming probability, provided the number of 
observations, N, is at least of the order of k\n(l/k). We are now ready to establish the main theorem 
concerning our minimization task. 

Theorem 10.2. Let 0 be an arbitrary tight frame and X a sensing matrix that satisfies the 4 '-RIP with 
82 k < 0.08, for some positive k. Then the solution, s*, of the minimization task in (10.32) satisfies the 
property 

||s -s*|l 2 < C 0 k ~2 ||'I' r s - (* T s) k \\ j + C\~fe, (10.34) 

where Cq, C\ are constants depending on 8 j k and (0 7 s) k denotes the best k-sparse approximation of 
0 7 s, which results by setting all but the k largest in magnitude components o/0 7 s equal to zero. 

The bound in (10.34) is the counterpart of that given in (9.36). In other words, the previous theorem 
States that if 4* 7 s decays rapidly, then s can be reconstructed from just a few (compared to the signal 
length /) observations. The theorem was first given in [25] and it is the first time that such a theorem 
provides results for the sparse analysis model formulation in a general context. 

10.5.2 C0SPARSITY 

In the sparse synthesis formulation, one searches for a solution in a union of subspaces that are formed 
by all possible combinations of k columns of the dictionary, 0. Our signal vector lies in one of these 
subspaces—the one that is spanned by the columns of 0 whose indices lie in the support set (Sec¬ 
tion 10.2.1). In the sparse analysis approach, things get different. The kick-off point is the sparsity of 
the transform s := 0 7 s , where <J> defines the transformation matrix or analysis operator. Because s 
is assumed to be sparse, there exists an index set I such that V/ e X, = 0. In other words. Vi e I, 
0 7 s := (0,, s) = 0, where 0, stands for the ith column of <J>. Hence, the subspace in which s resides is 
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the orthogonal complement of the subspace formed by those columns of <!> that correspond to a zero in 
the transform vector s. Assume, now, that card(Z) = C a . The signal, s, can be identified by searching 
on the orthogonal complements of the subspaces formed by ali possible combinations of C 0 columns 
of d>, that is, 

(0,-,s) = 0, V; e X. 

The difference between the synthesis and analysis problems is illustrated in Fig. 10.12. To facilitate the 
theoretical treatment of this new setting, the notion of cosparsity was introduced in [92]. 

Definition 10.2. The cosparsity of a signal s e M 1 with respect to a p x l matrix <f> 7 is defined as 

C 0 :=p- ||4> 7, s|| 0 . (10.35) 

In words, the cosparsity is the number of zeros in the obtained transform vector s = <t> 7 s ; i n con- 
trast, the sparsity measures the number of the nonzero elements of the respective sparse vector. If one 
assumes that <f> has “full spark,” 3 that is, / + 1, then any / of the columns of <J>, and thus any / rows 
of <t> r , is guaranteed to be independent. This indicates that for such matrices, the maximum value 
that cosparsity can take is equal to C 0 = l — 1. Otherwise, the existence of / zeros will necessarily 
correspond to a zero signal vector. Higher cosparsity levels are possible by relaxing the full spark 
requirement. 

Let now the cosparsity of our signal with respect to a matrix <t> 7 be C 0 . Then, in order to dig 
out the signal from the subspace in which it is hidden, one must form all possible combinations of 




FIGURE 10.12 

Searching for a sparse vector s. (A) In the synthesis model, the sparse vector lies in subspaces formed by com¬ 
binations of k (in this case k — 2) columns of the dictionary >h. (B) In the analysis model, the sparse vector lies 
in the orthogonal complement of the subspace formed by C 0 (in this case C a = 2) columns of the transformation 
matrix 4>. 


3 


Recall by Definition 9.2 that spark (<t>) is defined for ani x p matrix <t> with p > I and of full rank. 
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C a columns of <t> and search in their orthogonal complements. In the case that <I> is full rank, we 
have seen previously that C 0 < l, and, hence, any set of C 0 columns of <J> are linearly independent. In 
other words, the dimension of the span of those columns is C a . As a resuit, the dimensionality of the 
orthogonal complement, into which we search for s, is l — C 0 . 

We have by now accumulated enough information to elaborate a bit more on the statement made 
before concerning the different nature of the synthesis and analysis tasks. Let us consider a synthesis 
task using ani x p dictionary, and let k be the sparsity level in the corresponding expansion of a signal 
in terms of this dictionary. The dimensionality of the subspaces in which the solution is sought is k 
(k is assumed to be less than the spark of the respective matrix). Let us keep the same dimensionality 
for the subspaces in which we are going to search for a solution in an analysis task. Hence, in this case 
Co — l — k (assuming a full spark matrix). Also, for the sake of comparison, assume that the analysis 
matrix is p x /. Solving the synthesis task, one has to search (£) subspaces, while solving the analysis 
task one has to search ( c I[_ k ) subspaces. These are two different numbers; assuming that k / and 
also that l < p/2, which are natural assumptions for overcomplete dictionaries, then the latter of the 
two numbers is much larger than the former (use your computer to play with some typical values). 
In other words, there are many more analysis than synthesis low-dimensional subspaces for which to 
search. The large number of low-dimensional subspaces makes the algorithmic recovery of a solution 
from the analysis model a tougher task [92]. However, it might reveal a much stronger descriptive 
power of the analysis model compared to the synthesis one. 

Another interesting aspect that highlights the difference between the two approaches is the follow- 
ing. Assume that the synthesis and analysis matrices are related as <J> = 'L, as was the case for tight 
frames. Under this assumption, <t> T s provides a set of coefficients for the synthesis expansion in terms 
of the atoms of <t> = TT Moreover, if || <t> 7 s | () = k, then <i> T s is a possible A:-sparse solution for the 
synthesis model. However, there is no guarantee that this is the sparsest one. 

It is now time to investigate whether conditions that guarantee uniqueness of the solution for the 
sparse analysis formulation can be derived. The answer is in the affirmative, and it has been established 
in [92] for the case of exact measurements. 

Lemma 10.1. Let <t> be a transformation matrix of full spark. Then, for almost all N x I sensing 
matrices and for N > 2(1 — C 0 ), the equation 


y = Xs 

has at most one solution with cosparsity at least C 0 . 

The above lemma guarantees the uniqueness of the solution, if one exists, of the optimization 

• II T II 

min O s L, 

S 11 11 U 

s.t. y — Xs. (10.36) 

However, solving the previous to minimization task is a difficult one, and we know that its synthesis 
counterpart has been shown to be NP-hard, in general. Its relaxed convex relative is the t \ minimization 

min Ilo 7 '*II., 

S 11 11 1 


s.t. y = Xs. 


(10.37) 
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In [92], conditions are derived that guarantee the equivalence of the £q and l\ tasks, in (10.36) 
and (10.37), respectively; this is done in a way similar to that for the sparse synthesis modeling. Also, 
in [92], a greedy algorithm inspired by the orthogonal matching pursuit, discussed in Section 10.2.1, 
has been derived. A thorough study on greedy-like algorithms applicable to the cosparse model can 
be found in [62]. In [103] an iterative analysis thresholding is proposed and theoretically investigated. 
Other algorithms that solve the l\ optimization in the analysis modeling framework can be found in, 
for example, [21,49,108]. NESTA can also be used for the analysis formulation. Moreover, a critical 
aspect affecting the performance of algorithms obeying the cosparse analysis model is the choice of the 
analysis matrix <J>. It turns out that it is not always the best practice to use fixed and predefined matrices. 
As a promising alternative, problem-tailored analysis matrices can be learned using the available data 
(e.g., [106,119]). 


10.6 A CASE STUDY: TIME-FREQUENCY ANALYSIS 

The goal of this section is to demonstrate how ali the previously stated theoretical findings can be 
exploited in the context of a real application. Sparse modeling has been applied to almost everything. 
So picking up a typical application would not be easy. We preferred to focus on a less “publicized” 
application, i.e., that of analyzing echolocation signals emitted by bats. However, the analysis will take 
place within the framework of time-frequency representation, which is one of the research areas that 
significantly inspired the evolution of compressed sensing theory. Time-frequency analysis of signals 
has been a field of intense research for a number of decades, and it is one of the most powerful signal 
Processing tools. Typical applications include speech processing, sonar sounding, Communications, 
biological signals, and EEG processing, to name but a few (see, e.g., [13,20,57]). 

GAB0R TRANSF0RM AND FRAMES 

It is not our intention to present the theory behind the Gabor transform. Our goal is to outline some 
basic related notions and use them as a vehicle for the less familiar reader to better understand how 
redundant dictionaries are used and to get better acquainted with their potential performance benefits. 

The Gabor transform was introduced in the mid- 1940s by Dennis Gabor (1900-1979), a Hungarian- 
British engineer. His most notable scientific achievement was the invention of holography, for which 
he won the Nobel Prize for Physics in 1971. 

The discrete version of the Gabor transform can be seen as a special case of the short-time Fourier 
transform (STFT) (e.g., [57,87]). In the Standard DFT transform, the full length of a time sequence, 
comprising I samples, is used all in “one go” in order to compute the corresponding frequency content. 
However, the latter can be time-varying, so the DFT will provide an average information, which cannot 
be of much use. The Gabor transform (and the STFT in general) introduces time localization via the 
use of a window function, which slides along the signal segment in time, and at each time instant 
focuses on a different part of the signal. This is a way that allows one to follow the slow time variations 
which take place in the frequency domain. The time localization in the context of the Gabor transform 
is achieved via a Gaussian window function, that is, 

8< " ,:= vib exp (r^)- 


(10.38) 
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Fig. 10. 13A shows the Gaussian window, g{n — m), centered at time instant m. The choice of the 
window spreading factor, er, will be discussed later on. 

Let us now construet the atoms of the Gabor dictionary. Recall that in the case of the signal repre- 
sentation in terms of the DFT in (10.29), each frequency is represented only once, by the corresponding 
sampled sinusoid, (10.30). In the Gabor transform, each frequency appears / times; the corresponding 
sampled sinusoid is multiplied by the Gaussian window sequence, each time shifted by one sample. 
Thus, at the /th frequency bin, we have / atoms, g lmA) . m = 0, I,...,/ — 1, with elements given by 

g^'"''\n) = g{n — m)\j/i(n), n, m, i = 0,1,. .., l — 1, (10.39) 

where i/r; (n) is the «th element of the vector ij/ 1 in (10.30). This results in an overcomplete dictio¬ 
nary comprising / 2 atoms in the /-dimensional space. Fig. 10.13B illustrates the effect of multiplying 
different sinusoids with Gaussian pulses of different spread and at different time delays. Fig. 10.14 
is a graphical interpretation of the atoms involved in the Gabor dictionary. Each node («?, i) in this 
time-frequency plot corresponds to an atom of frequency equal to =yi and delay equal to m. 

Note that the windowing of a signal of finite duration inevitably introduces boundary effects, espe- 
cially when the delay m gets close to the time segment edges, 0 and / — 1. A solution that facilitates the 
theoretical analysis is to use a modulo l arithmetic to wrap around at the edge points (this is equivalent 
to extending the signal periodically); see, for example, [113]. 

Once the atoms have been defined, they can be stacked one next to the other to form the columns 
of the l x I 2 Gabor dictionary, G. It can be shown that the Gabor dictionary is a tight frame [125]. 

TIME-FREQUENCY RES0LUTI0N 

By definition of the Gabor dictionary, it is readily understood that the choice of the window spread, as 
measured by er, must be a critical factor, since it Controls the localization in time. As known from our 



FIGURE 10.13 


(A) The Gaussian window with spreading factor a centered at time instant m. (B) Pulses obtained by windowing 
three different sinusoids with Gaussian Windows of different spread and applied at different time instants. 
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FIGURE 10.14 

Each atom of the Gabor dictionary corresponds to a node in the time-frequency grid. That is, it is a sampled win- 
dowed sinusoid whose frequency and location in time are given by the coordinates of the respective node. In 
practice, this grid may be subsampled by factors a and fi for the two axes, respectively, in order to reduce the 
number of the involved atoms. 


Fourier transform basies, when the pulse becomes short, in order to increase the time resolution, its 
corresponding frequency content spreads out, and vice versa. From FIeisenberg’s principle, we know 
that we can never achieve high time and frequency resolution simultaneously; one is gained at the 
expense of the other. It is here where the Gaussian shape in the Gabor transform is justified. It can be 
shown that the Gaussian window gives the optimal tradeoff between time and frequency resolution [57, 
87]. The time-frequency resolution tradeoff is demonstrated in Fig. 10.15, where three sinusoids are 
shown windowed with different pulse durations. The diagram shows the corresponding spread in the 
time-frequency plot. The value of o> indicates the time spread and ay the spread of the respective 
frequency content around the basic frequency of each sinusoid. 

GABOR FRAMES 

In practice, / 2 can take large values, and it is desirable to see whether one can reduce the number of the 
involved atoms without sacrificing the frame-related properties. This can be achieved by an appropriate 
subsampling, as illustrated in Fig. 10.14. We only keep the atoms that correspond to the red nodes. That 
is, we subsample by keeping every a nodes in time and every fi nodes in frequency in order to form 
the dictionary, that is, 

G (0i/} ) = {g (m “' ,W }, m = 0,1,1, i = 0,1,^ - 1, 

a p 


















10.6 A CASE STUDY: TIME-FREQUENCY ANALYSIS 


519 



FIGURE 10.15 

The shorter the width of the pulsed (windowed) sinusoid is in time, the wider the spread of its frequency content 
around the frequency of the sinusoid. The Gaussian-like curves along the frequency axis indicate the energy spread 
in frequency of the respective pulses. The values of a, and cry indicate the spread in time and frequency, respec- 
tively. 


where a and /i are divisors of /. Then it can be shown (e.g., [57]) that if ap < l the resulting dictionary 
retains its frame properties. Once G ( a ^) is obtained, the canonical dual frame is readily available via 
(10.47) (adjusted for complex data), from which the corresponding set of expansion coefficients, 0, 
results. 

TIME-FREQUENCY ANALYSIS OF ECH0L0CATI0N SIGNALS EMITTED BY BATS 

Bats use echolocation for navigation (flying around at night), for prey detection (small insects), and 
for prey approaching and catching; each bat adaptively changes the shape and frequency content of 
its calls in order to better serve the previous tasks. Echolocation is used in a similar way for sonars. 
Bats emit calls as they fly, and “listen” to the returning echoes in order to build up a sonic map of 
their surroundings. In this way, bats can infer the distance and the size of obstacles as well as of other 
flying creatures/insects. Moreover, ali bats emit special types of calls, called social calls, which are 
used for socializing, flirting, and so on. The fundamental characteristics of the echolocation calls, for 
example, the frequency range and average time duration, differ from species to species because, thanks 
to evolution, bats have adapted their calls in order to become better suited to the environment in which 
a species operates. 

Time-frequency analysis of echolocation calls provides information about the species (species iden- 
tification) as well as the specific task and behavior of the bats in certain environments. Moreover, the 
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bat biosonar system is studied in order for humans to learn more about nature and get inspired for 
subsequent advances in applications such as sonar navigation systems, radars, and medical ultrasonic 
devices. 

Fig. 10.16 shows a case of a recorded echolocation signal from bats. Zooming at two different 
parts of the signal, we can observe that the frequency is changing with time. In Fig. 10.17, the DFT 
of the signal is shown, but there is not much information that can be drawn from it except that the 
signal is compressible in the frequency domain; most of the activity takes place within a short range of 
frequencies. 

Our echolocation signal was a recording of total length T — 21.845 ms [75]. Samples were taken 
at the sampling frequency f s = 750 kHz, which results in a total of / = 16,384 samples. Although the 
signal itself is not sparse in the time domain, we will take advantage of the fact that it is sparse in a 
transformed domain. We will assume that the signal is sparse in its expansion in terms of the Gabor 
dictionary. 

Our goal in this example is to demonstrate that one does not really need ali 16,384 samples to 
perform time-frequency analysis; ali the processing can be carried out using a reduced number of 
observations, by exploiting the theory of compressed sensing. To form the observations vector, y , the 
number of observations was chosen to be N — 2048. This amounts to a reduction of eight times with 



Time (ms) 


Time (ms) 


FIGURE 10.16 


The recorded echolocation signal. The frequency of the signal is time-varying, which is indicated by focusing on 
two different parts of the signal. 
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Frequency (KHz) 


FIGURE 10.17 

Plot of the energy of the DFT transform coefficients, S,. Observe that most of the frequency activity takes place 
within a short frequency range. 


respect to the number of available samples. The observations vector was formed as 


y = Xs , 

where X is an N x / sensing matrix comprising ± 1 generated in a random way. This means that once 
we obtain y, we do not need to store the original samples anymore, leading to a savings in memory re- 
quirements. Ideally, one could have obtained the reduced number of observations by sampling directly 
the analog signal at sub-Nyquist rates, as has already been discussed at the end of Section 9.9. Another 
goal is to use both the analysis and synthesis models and demonstrate their difference. 

Three different spectrograms were computed. Two of them, shown inFig. 10.18B and C, correspond 
to the reconstructed signals obtained by the analysis (10.37) and the synthesis (9.37) formulations, 
respectively. In both cases, the NESTA algorithm was used and the G(i28,64) frame was employed. 
Note that the latter dictionary is redundant by a factor of 2. The spectrograms are the resuit of plotting 
the time-frequency grid and coloring each node (t, i) according to the energy \0\ 2 of the coefficient 
associated with the respective atom in the Gabor dictionary. The full Gabor transform was applied to the 
reconstructed signals to obtain the spectrograms, in order to get better coverage of the time-frequency 
grid. The scale is logarithmic and the darker areas correspond to larger values. The spectrogram of the 
original signal obtained via the full Gabor transform is shown in Fig. 10.18D. It is evident that the 
analysis model resulted in a more ciear spectrogram, which resembles the original one better. When 
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(A) (B) 




(C) 


(D) 


FIGURE 10.18 

(A) Plot of the magnitude of the coefficients, sorted in decreasing order, in the expansion in terms of the G(i28,64) 
Gabor frame. The results correspond to the analysis and synthesis model formulations. The third curve corresponds 
to the case of analyzing the original vector signal directly, by projecting it on the dual frame. (B) The spectrogram 
from the analysis and (C) the spectrogram from the synthesis formulations. (D) The spectrogram corresponding to 
the G( 64 , 32 ) frame using the analysis formulation. For ali cases, the number of observations used was one-eighth of 
the total number of signal samples. A, B. and C indicate different parts of the signal, as explained in the text. 


the frame G( 64 , 32 ) is employed, which is a highly redundant Gabor dictionary comprising 8/ atoms, 
then the analysis model results in a recovered signal whose spectrogram is visually indistinguishable 
from the original one in Fig. 10. 18D. 
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Fig. 10.18A is the plot of the magnitude of the corresponding Gabor transform coefficients, sorted 
in decreasing values. The synthesis model provides a sparser representation in the sense that the co¬ 
efficients decrease much faster. The third curve is the one that results if we multiply the dual frame 
matrix G(i 28 , 64 ) directly with the vector of the original signal samples, and it is shown for comparison 
reasons. 

To conclude, the curious reader may wonder what these curves in Fig. 10.18D mean after ali. The 
call denoted by (A) belongs to a Pipistrelluspipistrellus (!) and the call denoted by (B) is either a social 
call or belongs to a different species. The signal (C) is the return echo from the signal (A). The large 
spread in time of (C) indicates a highly reflective environment [75]. 


PROBLEMS 


10.1 Show that the step in a greedy algorithm that selects the column of the sensing matrix in order 


to maximize the correlation between the column and the currently available error vector e (l 


is equivalent to selecting the column that reduces the I 2 norm of the error vector. 

Hint: All the parameters obtained in previous steps are fixed, and the optimization is with 
respect to the new column as well as the corresponding weighting coefficient in the estimate of 
the parameter vector. 

10.2 Prove the proposition that if there is a sparse solution to the linear system y = X0 such that 



where pt{X) is the mutual coherence of X, then the column selection procedure in a greedy 
algorithm will always select a column among the active columns of X, which correspond to 
the support of 0\ that is, the columns that take part in the representation of y in terms of the 
columns of X. 


Hint: Assume that 



1=1 


10.3 Give an explanation to justify why in step 4 of the CoSaMP algorithm the value of t is taken to 
be equal to 2 k. 

10.4 Show that if 



where 


d(0,0) c\\0 — OW 2 - \\X0-X0\\j, 


then minimization results in 
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10.5 Prove the basic recursion of the parallel coordinate descent algorithm. 

Hint: Assume that at the /th iteration, it is the tum of the /th component to be updated, so that 
the following is minimized: 

J(0j) = \\\y- xo^-v + ef-^xj - Qjxjwl + x\6j\. 

10.6 Derive the iterative scheme to minimize the weighted l i ball using a majorization-minimization 

procedure to minimize l n (l &i I + e), subject to the observations set y — X0. 

Hint: Use the linearization of the logarithmic function to bound it from above, because it is a 
concave function and its graph is located below its tangent. 

10.7 Show that the weighted l\ ball used in SpAPSM is upper bounded by the £o norm of the target 
vector. 

10.8 Show that the canonical dual frame minimizes the total norm of the dual frame, that is, 

zmi 

iel 

Note that this and the two subsequent problems are related to the appendix of this chapter. 
Hint: Use the resuit of Problem 9.10. 

10.9 Show that ParsevaPs tight frames are self-dual. 

10.10 Prove that the bounds A, B of a frame coincide with the maximum and minimum eigenvalues 
of the matrix product 'P 'P 1 . 

MATLAB® EXERCISES 

10.11 Construet a multitone signal having samples 

3 

0 n = aj cos ^ -— (2 m j — 1 )«^ , n = 0 ,..., l — 1 , 

j= 1 

where N = 30, / = 2 8 , a — [0.3, 1, 0.75] r , and m — [4, 10, 30] 7 . (a) Plot this signal in the 
time and in the frequency domain (use the “fft.m” MATLAB® function to compute the Fourier 
transform). (b) Build a sensing matrix 30 x 2 8 with entries drawn from a normal distribution, 
Af«). -^ 7 ), and recover 0 based on these observations by l\ minimization using, for example, 
“solvelasso.m” (see MATLAB® exercise 9.21). (c) Build a sensing matrix 30 x 2 8 , where each 
of its rows contains only a single nonzero component taking the value 1. Moreover, each column 
has at most one nonzero component. Observe that the multiplication of this sensing matrix with 
6 just picks certain components of 0 (those that correspond to the position of the nonzero value 
of each row of the sampling matrix). Show by solving the corresponding l\ minimization task, 
as in question (b), that 0 can be recovered exactly using such a sparse sensing matrix (containing 
only 30 nonzero components!). Observe that the unknown 0 is sparse in the frequency domain 
and give an explanation why the recovery is successful with the specific sparse sensing matrix. 

10.12 Implement the OMP algorithm (see Section 10.2.1) as well as the CSMP (see Section 10.2.1) 
with t — 2 k. Assume a compressed sensing system using normally distributed sensing matrix. 
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(a) Compare the two algorithms in the case where a — y = 0.2 for /i = 4 taking values in the 
set {0.1,0.2, 0.3, • • • ,1} (choose yourself a signal and a sensing matrix in order to comply with 
the recommendations above). (b) Repeat the same test when a — 0.8. Observe that this experi- 
ment, if it is performed for many different a values, 0 < a < 1, can be used for the estimation 
of phase transition diagrams such as the one depicted in Fig. 10.4. (c) Repeat (a) and (b) with 
the obtained measurements now contaminated with noise corresponding to 20 dB SNR. 

10.13 Reproduce the MRI reconstruction experiment of Fig. 10.9 by running the MATLAB® script 
“MRlcs.m,” which is available from the website of the book. 

10.14 Reproduce the bat echolocation time-frequency analysis experiment of Fig. 10.18 by running 
the MATLAB® script “BATcs.m,” which is available from the website of the book. 
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11.1 INTRODUCTION 

Our emphasis in this chapter will be on learning nonlinear models. The necessity of adopting nonlinear 
models has already been discussed in Chapter 3, in the context of the classification as well as the 
regression tasks. For example, recall that given two jointly distributed random vectors (y, x)eK J xR 1 , 
we know that the optimal estimate of y given x — x in the mean-square error (MSE) sense is the 
corresponding conditional mean, that is, E[y|jc], which in general is a nonlinear function of x. 

There are different ways of dealing with nonlinear modeling tasks. Our emphasis in this chapter will 
be on a path through the so-called reproducing kernel Hilbert spaces (RKHS). The technique consists 
of mapping the input variables to a new space, such that the originally nonlinear task is transformed into 
a linear one. From a practical point of view, the beauty behind these spaces is that their rich structure 
allows us to perform inner product operations in a very efficient way, with complexity independent of 
the dimensionality of the respective RKHS. Moreover, note that the dimension of such spaces can even 
be infinite. 

We start the chapter by reviewing some more “traditional” techniques concerning Volterra series 
expansions, and then we move slowly to exploring the RKHS concept. Cover’s theorem, the basic 
properties of RKHSs, and their defining kernels are discussed. Kernel ridge regression and the support 
vector machine (SVM) framework is presented. Approximation techniques concerning the kernel func¬ 
tion and the kernel matrix, such as random Fourier features (RFFs) and the Nystrom method, and their 
implication to Online and distributed learning in RKHS are discussed. Finally, some more advanced 
concepts related to sparsity and multikernel representations are presented. A case study in the context 
of text mining is presented at the end of the chapter. 


11.2 GENERALIZED LINEAR MODELS 

Given (y, xjelxtf.a generalized linear estimator y of y has the form 

K 

y = / ( x ) : = @o + Qk4>k (x), (li.i) 

k= l 

where </q,..., 4>k are preselected (nonlinear) functions. A popular family of functions is the polyno- 
mial one, for example, 


1 l-i I 1 

y = Oo + J2 d ‘ X ' 6 i"‘ X ‘ X m + X^x?' (H-2) 

1=1 i =1 m=i +1 i=l 

Assuming 1 — 2 (x = [xj, xj] T ), (11.2) can be brought into the form of (11.1) by setting K — 5 and 
</q(x) = xi, 02(x) = X 2 , </> 3 (x) = X 1 X 2 , 04(x) = Xj, 0s(x) = x\. The generalization of (11.2) to rth- 
order polynomials is readily obtained and it will contain products of the form xj’ 1 x^ 2 • • • xp , with 
Pi + P 2 + • • • + Pl < r. It turns out that the number of free parameters, K, for an /-th-order polynomial 
is equal to 

„ (l + r)! 


rW 
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Just to get a feeling for / = 10 and r — 3, K — 286. The use of polynomial expansions is justified 
by the Weierstrass theorem, stating that every continuous function defined on a compact (closed and 
bounded) subset S cR f can be uniformly approximated as closely as desired, with an arbitrarily small 
error, e, by a polynomial function (for example, [95]). Of course, in order to achieve a good enough 
approximation, one may have to use a large value of r. Besides polynomial functions, other types of 
functions can also be used, such as splines and trigonometric functions. 

A common characteristic of this type of models is that the basis functions in the expansion are 
preselected and they are fixed and independent of the data. The advantage of such a path is that the 
associated models are linear with respect to the unknown set of free parameters, and they can be es- 
timated by following any one of the methods described for linear models, presented in Chapters 4-8. 

However, one has to pay a price for that. As shown in [7], for an expansion involving K fixed func- 

2 

tions, the squared approximation error cannot be made smaller than order ('. In other words, for 
high-dimensional spaces and in order to get a small enough error, one has to use large values of K. 
This is another face of the curse of dimensionality problem. In contrast, one can get rid of the depen- 
dence of the approximation error on the input space dimensionality, /, if the expansion involves data 
dependent functions, which are optimized with respect to the specihc data set. This is, for example, the 
case for a class of neural networks, to be discussed in Chapter 18. In this case, the price one pays is 
that the dependence on the free parameters is now nonlinear, making the optimization with regard to 
the unknown parameters a harder task. 


11.3 VOLTERRA, WIENER, AND HAMMERSTEIN MODELS 

Let us start with the case of modeling nonlinear systems, where the involved input-output entities 
are time series/discrete-time signals denoted as ( u„ , d„), respectively. The counterpart of polynomial 
modeling in (1 1.2) is now known as the Volterra series expansion. 

These types of models will not be pursued any more in this book, and they are briefly discussed here 
in order to put the nonlinear modeling task in a more general context as well as for historical reasons. 
Thus, this section can be bypassed in afirst reading. 


Filter 


FIGURE 11.1 

The nonlinear filter is excited by u n and provides in its output d„. 

Volterra was an Italian mathematician (1860-1940) who made major contributions to mathematics 
as well as physics and biology. One of his landmark theories is the development of Volterra series, 
which is used to solve integral and integro-differential equations. He was one of the Italian professors 
who refused to take an oath of loyalty to the fascist regime of Mussolini and he was obliged to resign 
from his university post. 

Fig. 11.1 shows an unknown nonlinear system/filter with the respective input-output signals. The 
output of a discrete-time Volterra model can be written as 
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r M M M k 

* = £££■■■£ Wk(i\, i 2, ..., ik) | | Un—iji (11.3) 

fc=i/i= 0 i 2 =o 4=0 j~ i 

where Wk(-, denotes the 4th-order Volterra kernel', in general, r can be infinite. For example, for 
r = 2 and M = 1, the input-output relation involves the linear combination of the terms 

2 2 

U n , Un — 11 M;,’ 1 ’ U n U n —1 ■ 


Special cases of the Volterra expansion are the Wiener , Hammerstein, and Wiener-Hammerstein mod- 
els. These models are shown in Fig. 11.2. The systems /;(•) and g(-) are linear systems with memory, 
that is, 


and 


Mi 

s fl — ^ h n H/i—i 
;=0 


m 2 

dn — ^ ' gn^n—i • 
/=0 


The Central box corresponds to a memoryless nonlinear system, which can be approximated by a poly- 
nomial of degree r. Hence, 


X n — ^ Ck (Sn ) ■ 
k= 1 

In other words, a Wiener model is a linear time-invariant (LTI) system followed by the memoryless 
nonlinearity and the Hammerstein model is the combination of a memoryless nonlinearity followed 
by an LTI system. The Wiener-Hammerstein model is the combination of the two. Note that each one 
of these models is nonlinear with respect to the involved free parameters. In contrast, the equivalent 
Volterra model is linear with regard to the involved parameters; however, the number of the resulting 
free parameters is significantly increased with the order of the polynomial and the filter memory taps 


Hammerstein 



Wiener 


FIGURE 11.2 


The Wiener model comprises a linear filter followed by a memoryless polynomial nonlinearity. The Hammerstein 
model consists of a memoryless nonlinearity followed by a linear filter. The Wiener-Hammerstein model is the 
combination of the two. 
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(Mj and M 2 ). The interesting feature is that the equivalent to a Hammerstein model Volterra expansion 
consists only of the diagonal elements of the associated Volterra kernels. In other words, the output is 
expressed in terms of u n . u n - 1 , m„_ 2 ,... and their powers; there are no cross-product terms [60]. 

Remarks 11.1. 

• The Volterra series expansion was first introduced as a generalization of the Taylor series expansion. 
Following [101], assume a memoryless nonlinear system. Then its input-output relationship is given 
by 


d(t) = f(u(t)), 

and adopting the Taylor expansion, for a particular time t e (— 00 , + 00 ), we can write 


+OO 

d(t) = Ys„{u{t)) n , 

n =0 


(11.4) 


assuming that the series converges. The Volterra series is the extension of (11.4) to systems with 
memory, and we can write 


d (t) = wq + 


/ +00 

' 

-OO 


W\{r\)u(t — X\)dX\ 


-00 
r»+oo /»+00 


+ 


/ -t-00 r 
-00 J —OO 


W 2 (Ti, T 2 )u(t — X\)u(t — X2)dT\dX2 


(11.5) 


In other words, the Volterra series is a power series with memory. The problem of convergence of 
the Volterra series is similar to that of the Taylor series. In analogy to the Weierstrass approximation 
theorem, it turns out that the output of a nonlinear system can be approximated arbitrarily close 
using a sufficient number of terms in the Volterra series expansion [44] . 

A major difficulty in the Volterra series is the computation of the Volterra kernels. Wiener was the 
first to realize the potential of the Volterra series for nonlinear system modeling. In order to compute 
the involved Volterra kernels, he used the method of orthogonal functionals. The method resembles 
the method of using a set of orthogonal polynomials, when one tries to approximate a function via a 
polynomial expansion [130]. More on the Volterra modeling and related models can be obtained in, 
for example, [57,70,102]. Volterra models have been extensively used in a number of applications, 
including Communications (e.g., [11]), biomedical engineering (e.g., [72]), and automatic control 
(e.g., [32]). 


1 The proof involves the theory of continuous functionals. A functional is a mapping of a function to the real axis. Observe that 
each integral is a functional, for a particular t and kemel. 
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11.4 C0VER’S THEOREM: CAPACITY OF A SPACE IN LINEAR DICHOTOMIES 

We have already justified the method of expanding an unknown nonlinear function in terms of a 
fixed set of nonlinear ones, by mobilizing arguments from the approximation theory. Although this 
framework fits perfectly to the regression task, where the output takes values in an interval in R, such 
arguments are not well suited for the classification. In the latter case, the output value is of a discrete 
nature. For example, in a binary classification task, y e {1, — 1}, and as long as the sign of the predicted 
value, y, is correct, we do not care how close y and y are. In this section, we will present an elegant 
and powerful theorem that justifies the expansion of a classifier / in the forni of (11.1). It suffices to 
look at ( 11.1 ) from a different angle. 

Let us consider N points, X|. X 2 ,.... x\; e R/. We can say that these points are in generalposition 
if there is no subset of / + 1 of them lying on an (/ — l)-dimensional hyperplane. For example, in the 
two-dimensional space, any three of these points are not permitted to lie on a straight line. 

Theorem 11.1 (Cover’s theorem). The number of groupings, denoted as 0(N, l), that can be formed 
by (/ — 1 )-dimensional hyperplanes to separate the N points in two classes, exploiting ali possible 
combinations, is given by (see [31 ] and Problem 11.1 ) 

0(N,l) = 2j:( N - l \ 
i =o v / 

where 


N-l\ (N — 1)! 
i ) ~ (N- 1 -/)!/!' 



FIGURE 11.3 

The possible number of linearly separable groupings of two, for four points in the two-dimensional space is 
0(4,2) = 14 = 2x7. 

Each one of these groupings in two classes is also known as a (linear) dichotomy. Fig. 11.3 il- 
lustrates the theorem for the case of N — 4 points in the two-dimensional space. Observe that the 
possible groupings are [(ABCD)], [A, (BCD)], [B, (ACD)], [C, (ABD)], [D, (ABC)], [(AB), (CD)], and 
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FIGURE 11.4 

For N > 2(1 + 1) the probability of linear separability becomes small. For large values of /, and provided 
N < 2(1 + 1), the probability of any grouping of the data into two classes to be linearly separable tends to unity. 
Also, if N < (/ + 1), all possible groupings in two classes are linearly separable. 


[(AC), (BD)]. Each grouping is counted twice, as it can belong to either u>\ or w 2 class. Hence, the total 
number of groupings is 14, which is equal to 0(4 , 2). Note that the number of all possible combinations 
of N points in two groups is 2 N , which is 16 in our case. The grouping that is not counted in 0(4, 2), 
as it cannot be linearly separated, is [(BC),(AD)]. Note that if N < l + 1, then 0(N, l ) = 2 ,v . That is, 
all possible combinations in groups of two are linearly separable; verify it for the case of N = 3 in the 
two-dimensional space. 

Based on the previous theorem, given N points in the /-dimensional space, the probability of group¬ 
ing these points in two linearly separable classes is 


p i _0(N,l) 

F n~ 2" 



N >1+1, 
N <1+1. 


( 11 . 6 ) 


To visualize this finding, let us write N = r(l + 1), and express the probability P l N in terms of r, for a 
fixed value of l. The resulting graph is shown in Fig. 11.4. Observe that there are two distinet regions. 
One to the left of the point r = 2 and one to the right. At the point r — 2, that is, N = 2(1 + \ ), the 
probability is always j, because 0(21 + 2,1) = 2 2,+1 (Problem 11.2). Note that the larger the value of 
l is, the sharper the transition from one region to the other becomes. Thus, for high-dimensional spaces 
and as long as N < 2(1 + 1), the probability of any grouping of the points in two classes to be linearly 
separable tends to unity. 

The way the previous theorem is exploited in practice is the following. Given N feature vectors 
x n eM 1 , n = 1,2,..., N, amapping 

(/>: R 1 3 x„ 1 —* 0( x n ) e R K , K » /, 

is performed. Then according to the theorem, the higher the value of K is, the higher the probability 
becomes for the images of the mapping, <t>(x n ) e K K , n = 1,2,..., N, to be linearly separable in 
the space Note that the expansion of a nonlinear classifier (that predicts the label in a binary 
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classification task) is equivalent to using a linear one on the images of the original points after the 
mapping. Indeed, 

K 

f(x) = 'y]dk<l>k(.x) + 9o = 0' 


:= [<P\(x),<p 2 (x), K (x)] T . 

Provided that K is large enough, our task is linearly sepa rabie in the new space, RA, with high proba- 
bility, which justibes the use of a linear classifier, 0, in (1 1 .7). The procedure is illustrated in Fig. 11.5. 
The points in the two-dimensional space are not linearly separable. However, after the mapping in the 
three-dimensional space, 

[x\,x 2 ] T I — > (j>{x) = [x\, X2, f(x I,x 2 )] r , f(x],x 2 ) = 4exp ( — (xj + xf)/3) + 5, 

the points in the two classes become linearly separable. Note, however, that after the mapping, the 
points lie on the surface of a paraboloid. This surface is fully described in terms of two free variables. 
Loosely speaking, we can think of the two-dimensional plane, on which the data lie originally, to be 
folded/transformed to form the surface of the paraboloid. This is basically the idea behind the more 
general problem. After the mapping from the original /-dimensional space to the new K -dimensional 
one, the images of the points (j>{x n ), n = 1, 2,..., N, lie on an /-dimensional surface ( manifold ) in 
R^ [19]. We cannot fool nature. Because / variables were originally chosen to describe each pattern 


<t>(x) 

1 


(11.7) 



FIGURE 11.5 

The points (red in one class and black for the other) that are not linearly separable in the original two-dimensional 
plane become linearly separable after the nonlinear mapping in the three-dimensional space; one can draw a plane 
that separates the “black” from the “red” points. 
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(dimensionality, number of free parameters) the same number of free parameters will be required to 
describe the same objects after the mapping in M . K . In other words, after the mapping, we embed an 
/-dimensional manifold in a K -dimensional space in such a way that the data in the two classes become 
linearly separable. 

We have by now fully justified the need for mapping the task from the original low-dimensional 
space to a higher-dimensional one, via a set of nonlinear functions. However, life is not easy to work in 
high-dimensional spaces. A large number of parameters are needed; this in turn poses computational 
complexity problems, and raises issues related to the generalization and overfitting performance of 
the designed predictors. In the sequel, we will address the former of the two problems by making a 
“careful” mapping to a higher-dimensional space of a specific structure. The latter problem will be 
addressed via regularization, as has already been discussed in various parts in previous chapters. 


11.5 REPRODUCING KERNEL HILBERT SPACES 

Consider a linear space H of real-valued functions defined on a set 2 3 X C R/. Furthermore, suppose 
that H is a Hilbert space; that is, it is equipped with an inner product operation, (•, -)h, that defines 
a corresponding norm || • ||n and H is complete with respect to this norm. From now on, and for 
notational simplicity, we omit the subscript H from the inner product and norm notations, and we are 
going to use them only if it is necessary to avoid confusion. 

Definition 11.1. A Hilbert space H is called reproducing kernel Hilbert spcice (RKHS) if there exists 
a function 


k:X xX\ 


with the following properties: 

• for every jc e X, k(-, jc) belongs to H; 

• k(-, •) has the so-called reproducing property, that is, 


f(x ) = (/, k(-, jc)), V/ eH, Wx e X : reproducing property. 


( 11 . 8 ) 


In words, the kernel function is a function of two arguments. Fixing the value of one of them to, say, 
x e X, then the kernel becomes a function of single argument (associated with the ■) and this function 
belongs to H. The reproducing property means that the value of any function / e H, at any x e X, is 
equal to the respective inner product, performed in H, between / and k(-,x). 

A direct consequence of the reproducing property, if we set /(•) = k(-, y), _y e X, is that 


(*(•, jO, *(•, *)) = * (*, y) = ic(y, x ). 


(11.9) 


2 Generalization to more general sets is also possible. 

3 For the unfamiliar reader, a Hilbert space is the generalization of Euclidean space allowing for infinite dimensions. More 
rigorous definitions and related properties are given in the appendix of Chapter 8. 
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Definition 11.2. Let H be an RKHS, associated with a kernel function k(-, •), and X a set of elements. 
Then the mapping 

X 9 jc i— (p(x ) := k (•, x) ei: feature map 

is known as feature map and the space H as the feature space. 

In other words, if X is the set of our observation vectors, the feature mapping maps each vector 
to the RKHS EI. Note that, in general, EI can be of infinite dimension and its elements are functions. 
That is, each training point is mapped to a function. In special cases, where H is of a finite dimen- 
sion, K , the image can be represented as an equivalent vector <t>(x) e IRA. From now on, the general 
infinite-dimensional case will be treated and the images will be denoted as functions, </>(•)■ 

Let us now see what we have gained by choosing to perform the feature mapping from the original 
space to a high-dimensional RKHS one. Let x, y e X C RE Then the inner product of the respective 
mapping images is written as 

{f(x),(p{y)) = (*•(•, x), k(-, y)), 


or 

(tp(x), <p(y)) — k(x, y) : kernel trick. 

In other words, employing this type of mapping to our problem, we can perform inner product op- 
erations in EI in a very efficient way; that is, via a function evaluation performed in the original 
low-dimensional space! This property is also known as the kernel trick, and it facilitates significantly 
the computations. As will become apparent soon, the way this property is exploited in practice involves 
the following steps: 

1. Map (implicitly) the input training data to an RKHS 

x n i—> f(x n ) ei, n = 1,2,..., IV. 

2 . Solve a linear estimation task in H, involving the images <l>(x„), n = 1,2,... , N. 

3. Cast the algorithm that solves for the unknown parameters in terms of inner product operations in 
the form 

[f(Xi),(p(Xj)), i, j = 1,2 . N. 

4. Replace each inner product by a kernel evaluation, that is. 


{<p(Xi),<p(Xj)) = K(Xi,Xj). 


It is apparent that one does not need to perform any explicit mapping of the data. All is needed is to 
perform the kernel operations at the final step. Note that the specific form of k(-, •) does not concern 
the analysis. Once the algorithm for the prediction, y, has been derived, one can use different choices 
for k(-, •)■ As we will see, different choices for k(-, •) correspond to different types of nonlinearity. 
Fig. 11.6 illustrates the rationale behind the procedure. In practice, the four steps listed above are 
equivalent to (a) work in the original (low-dimensional Euclidean space) and express all operations in 
terms of inner products and (b) at the final step substitute the inner products with kernel evaluations. 
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High-dimensional 
RKHS, M 



FIGURE 11.6 

The nonlinear task in the original low-dimensional space is mapped to a linear one in the high-dimensional 
RKHS H. Using feature mapping, inner product operations are efficiently performed via kernel evaluations in 
the original low-dimensional spaces. 


Example 11.1. The goal of this example is to demonstrate that one can map the input space into 
another of higher dimension (finite-dimensional for this case 4 ), where the corresponding inner product 
can be computed as a function performed in the lower-dimensional original one. Consider the case of 
the two-dimensional space and the mapping into a three-dimensional one, i.e., 

R 2 b x i—» </>{x ) = [x 2 , V2.X1X2, x 2 ] e R 3 . 

Then, given two vectors x = [xi, X 2 Y and y — [yi, yi] T , it is straightforward to see that 

<t> T (x)<l>(y) = (x 1 y) 2 . 

That is, the inner product in the three-dimensional space, after the mapping, is given in terms of a 
function of the variables in the original space. 

11.5.1 SOME PROPERTIES AND THEORETICAL HIGHLIGHTS 

The reader who has no “mathematica1 arvcieties” can bypass this subsection during afirst reading. 

Let X be a set of points. Typically A” is a compact (closed and bounded) subset of R 1 . Consider a 
function 


:XxXi 


4 If a space of functions has finite dimension, then it is equivalent (isometrically isomorphic) to a finite Euclidean linear/vector 
space. 
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Definition 11.3. The function /c is called a positive definite kernel if 


N N 

Y Y, a n a m K(x n , x„, ) > 0 : positive definite kernel, 

n— 1 m= 1 


( 11 . 10 ) 


for any real numbers, a „, a m , any points x n , x m e X , and any N e N. 

Note that (11.10) can be written in an equivalent form. Define the so-called kernel matrix JC of 
order N , 


Then (1 1.10) is written as 


k(xi,xi) 


/C:= 


_k(x N ,x\) 


k(x u xn ) 


k(x n ,xn). 


a 1 ICa > 0, 


( 11 . 11 ) 


( 11 . 12 ) 


where 

a = [a\, cin] t . 

Because (1 1.10) is true for any a e R. ,v , (11.12) suggests that for a kernel to be positive definite, it 
suffices for the corresponding kernel matrix to be positive semidefinite. 

Lemma 11.1. The reproducing kernel associated with an RKHS H is a positive definite kernel. 

The proof of the lemma is given in Problem 11.3. Note that the opposite is also true. It can be 
shown [82,106] that if k : X x X i—> M is a positive definite kernel, then there exists an RKHS EI 
of functions on X such that /<:(•,•) is a reproducing kernel of H. This establishes the equivalence 
between reproducing and positive definite kernels. Historically, the theory of positive definite kernels 
was developed first in the context of integral equations by Mercer [76], and the connection to RKHS 
was developed later on (see, for example, [3]). 

Lemma 11.2. Let H be an RKHS on the set X with reproducing kernel k(-, •)■ Then the linear span of 
the function k(-,x), x e X, is dense in H, that is, 


H = span{Kf, x), x e X}. 


(11.13) 


The proof of the lemma is given in Problem 11 .4. The overbar denotes the closure of a set. In other 
words, EI can be constructed by all possible linear combinations of the kernel function computed in X, 
as well as the limit points of sequences of such combinations. Simply stated, EI can b e fully generated 
from the knowledge of k(-, ■). 

The interested reader can obtain more theoretical results concerning RKHSs from, for example, 

[64,83,87,103,106,108]. 


5 It may be slightly confusing that the definition of a positive definite kernel requires a positive semidefinite kernel matrix. 
However, this is what has been the accepted definition. 
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1.2 



(B) 


FIGURE 11.7 

(A) The Gaussian kernel for X = R, a = 0.5. (B) The function /c(-, 0) for different values of a. 


11.5.2 EXAMPLES OF KERNEL FUNCTIONS 

In this subsection, we present some typical examples of kernel functions, which are commonly used in 
various applications. 

• The Gaussian kernel is among the most popular ones, and it is given by our familiar form 

k(x, y) = exp 

with cr > 0 being a parameter. Fig. 1 1.7A shows the Gaussian kernel as a function of x, y e X = K 
and cr = 0.5. Fig. 11 ,7B shows the graph of resulting function if we set one of the kerneFs arguments 
to 0, i.e., k(-, 0), for various values of a. 
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The dimension of the RKHS generated by the Gaussian kernel is infinite. A proof that the Gaus- 
sian kernel satisfies the required properties can be obtained, for example, from [108]. 

• The homogeneous polynomial kernel has the form 

K(x,y) = (x T y) r , 


where r is a parameter. 




(B) 


FIGURE 11.8 


(A) The inhomogeneous polynomial kernel for X = R, r = 2. (B) The graph of k (-, ro) for different values of.ro- 
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• The inhomogeneous polynomial kernel is given by 

K(x,y) = (x T y + c) r , 

where c > 0 and r > 0, reN, are parameters. The graph of the kernel is given in Fig. 1 1.8A. In 
Fig. 1 1.8B the graphs of the function (/>(■, xq) are shown for different values of x$ . The dimension- 
ality of the RKHS associated with polynomial kernels is finite. 

• The Laplacian kernel is given by 


k(x, y) — exp (-TH* — y ||), 


where t > 0 is a parameter. The dimensionality of the RKHS associated with the Laplacian kernel 
is infinite. 

• The spline kernels are defined as 


/f(x,y) = B 2 p+t(ll-»:-y|| 2 ), 

where the B n spline is defined via the n + 1 convolutions of the unit interval [— j], that is, 

n +1 

B n {-) '■= i](0, 

i=t 


and y r i i ,(•) is the characteristic function on the respective interval. 

L 2 ’ 2 J 

• The sampling function or sine kernel is of particular interest from a signal processing point of view. 
This kernel function is defined as 

sin(7r.v) 
sinc(x) =-. 

7 XX 

Recall that we have met this function in Chapter 9 while discussing sub-Nyquist sampling. 

Let us now consider the set of ali squared integrable functions, which are bandlimited , that is, 


Fb = 


/ +oo 

I 

-oo 


\f(x)\ z dx < +oo, and |F(&>)| = 0, \u>\ 


where F(a>) is the respective Fourier transform 


F(o>) = 



f(x)e j° Jx dx. 


It turns out that T n is an RKHS whose reproducing kernel is the sine function (e.g., [51]), that is, 


x(x,y) = sinc(.r — y). 


6 


It is equal to one if the variable belongs to the interval and zero otherwise. 
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This takes us back to the classical sampling theorem through the RKHS route. Without going into 
details, a by-product of this view is the Shannon sampling theorem; any bandlimited function can 
be written as 

f(x) = ^ f(n ) sinc(.x — ri). (11.14) 

n 


Constructing Kernels 

Besides the previous examples, one can construet more kernels by applying the following properties 
(Problem 11.6, [106]); 

• If 


are kernels, then 

and 

and 

are also kernels. 

• Let 

Then 

is a kernel. 

• Let a function 

and a kernel function 

Then 


k i (x , y) : X x X i — » R, 
a:2(x, y) '■ X i—* R 

k{x, y) = k\ (x, y ) + /c 2 (x, y) 

k(x, y) — axi(x, y), a > 0, 

K(x,y) = Ki(x,y)K 2 (x,y) 


f-X\ 


x(x,y) = f(x)f(y) 


8-X 


R' 


(-, ■) :M! xW 1 \ 


x(x,y) = K i (g(x),g(y)) 


7 The key point behind the proof is that in J 7 #, the kernel k(x, y) can be decomposed in terms of a set of orthogonal functions, 
that is, sinc(x — n), n = 0, ±1, ±2,_ 
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is also a kernel. 

• Let A be a positive definite / x l matrix. Then 


k(x, y) — x T Ay 


is a kernel. 


• If 


k\(x, y) \ X x X i—> ]R, 


then 


k(x, y) = expOa(x,.y)) 

is also a kernel, and if />(■) is a polynomial with nonnegative coefficients, 


K(x,y) = p (ati (a:, 37)) 


is also a kernel. 

The interested reader will find more information concerning kernels and their construction in, for ex- 
ample, [52,106,108]. 

String Kernels 

So far, our discussion has been focused on input data that were vectors in a Euclidean space. However, 
as we have already pointed out, the input data need not necessarily be vectors, and they can be elements 
of more general sets. 

Let us denote by S an alphabet set; that is, a set with a finite number of elements, which we call 
symbols. For example, this can be the set of ali capital letters in the Latin alphabet. Bypassing the path 
of formal definitions, a string is a finite sequence of any length of symbols from S. For example, two 
cases of strings are 


T { = “MYNAMEISSERGIOS”, T 2 = “HERNAMEISDESPOINA”. 


In a number of applications, such as in text mining, spam filtering, text summarization, and bioinfor- 
matics, it is important to quantify how “similar” two strings are. However, kernels, by their definition, 
are similarity measures; they are constructed so as to express inner products in the high-dimensional 
feature space. An inner product is a similarity measure. Two vectors are most similar if they point to 


the same direction. Starting from this observation, there has been a lot of activity on defining kernels 


that measure similarity between strings. Without going into details, let us give such an example. 


Let us denote by S* the set of all possible strings that can be constructed using symbols from S. 


Also, a string s is said to be a substring of x if x = bsa, where a and b are other strings (possibly 
empty) from the symbols of S. Given two strings x, y e S*, define 



seS* 


(11.15) 
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where w s > 0, and (p s (x) is the number of times substring s appears in x. It turns out that this is indeed 
a kernel, in the sense that it complies with (11.10); such kernels constructed from strings are known as 
string kernels. 

Obviously, a number of different variants of this kernel are available. The so-called k-spectrum 
kernel considers common substrings only of length k. For example, for the two strings given before, 
the value of the 6-spectrum string kernel in (11.15) is equal to one (one common substring of length 
6 is identified and appears once in each one of the two strings: “NAMEIS”). The interested reader can 
obtain more on this topic from, for example, [106]. We will use the notion of the string kernel in the 
case study in Section 11.15. 


11.6 REPRESENTERTHEOREM 

The theorem to be stated in this section is of major importance from a practical point of view. It allows 
us to perform empirical loss function optimization, based on a finite set of training samples, in a very 
efficient way even if the function to be estimated belongs to a very high (even infinite)-dimensional 
RKHS EL 

Theorem 11.2. Let 


£2 : [0, +oo) i—> K 

be an arbitrary strictly monotonic increasing function. Let also 

C : K 2 i—K U {oo} 


be an arbitrary loss function. Then each minimizer / e I ofthe regularized minimization task 

N 

min/(/):= V£(y„,/(*„))+kC>(||/|| 2 ) ( 11 . 16 ) 

/eM »=t 


admits a representation ofthe forni 


N 

/(■) = ^0 n K(-,x„), 

n =1 


where 9 n e R, n = 1,2,..., N. 


(11.17) 


° The property holds also for regularization of the form £2(||/||), since the quadratic function is strictly monotonic on [0, oo), 
and the proof follows a similar line. 
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Proof. The linear span, A := span{/r(-, *i),... forms a closed subspace. Then each / ei 

can be decomposed into two parts (see (8.20)), i.e., 

N 

f(-) = ^20nK(-, x n) + f_L, 

n= 1 


where f± is the part of / that is orthogonal to A. From the reproducing property, we obtain 


/(*m) =(/>(*, * m )) = 



K (‘ 5 X n ), K (•, X m ) 


N 

— ^ ^ @nK(Xm ■> Xn) i 
n= 1 


where we used the fact that (f±, k(-, x n )) =0, n = 1,2,..., /V. In other words, the expansion in 
(11.17) guarantees that at the training points, the value of / does not depend on f±. Hence, the hrst 
terni in (11.16), corresponding to the empirical loss, does not depend on f±. Moreover, for ali /j_ we 
have 


n(||/ir) = n 


^2 e nK(-,X n ) 


n =1 


nyj.ir 


> S2 


N 


'y ' o n K{‘ , x n ) 


n =1 



Thus, for any choice of 0 n , n = 1,2,..., N , the cost function in (11. 16) is minimized for f± = 0. Thus, 
the claim is proved. □ 

The theorem was hrst shown in [61]. In [2], the conditions under which the theorem exists were 
investigated and related sufficient and necessary conditions were derived. The importance of this the¬ 
orem is that in order to optimize (11.16) with respect to /, one uses the expansion in (11.17) and 
minimization is carried out with respect to the finite set of parameters 6 n , n = 1,2,..., N. 

Note that when working in high/infinite-dimensional spaces, the presence of a regularizer can 
hardly be avoided; otherwise, the obtained solution will suffer from overfitting, as only a finite number 
of data samples are used for training. The effect of regularization on the generalization performance 
and stability of the associated solution has been studied in a number of classical papers (for example, 
[18,39,79]). 

Usually, a bias term is often added and it is assumed that the minimizing function admits the fol- 
lowing representation: 

f = f + b , (11.18) 

N 

f(-) = ^20„K(-,X n ). 

n= 1 


(11.19) 
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In practice, the use of a bias term (which does not enter in the regularization) turns out to improve per- 
formance. First, it enlarges the class of functions in which the solution is searched and potentially leads 
to better performance. Moreover, due to the penalization imposed by the regularizing term, f2(||/|| 2 ), 
the minimizer pushes the values which the function takes at the training points to smaller values. The 
existence of b tries to “absorb” some of this action (see, for example, [ 108]). 

Remarks 11.2. 

• We will use the expansion in (1 1.17) in a number of cases. However, it is interesting to apply this 
expansion to the RKHS of the bandlimited functions and see what comes out. Assume that the 
available samples from a function / are f(n), n= 1,2,... ,N (assuming the case of normalized 
sampling period x s = 1). Then according to the representer theorem, we can write the following 
approximation: 

N 

f(x) « E sinc(x — n). (11.20) 

n= 1 

Taking into account the orthonormality of the sinc(- — n) functions, we get 0 n = f(n), n — 
1,2, .... N . However, note that in contrast to (1 1.14), which is exact, (1 1.20) is only an approx¬ 
imation. On the other hand, (1 1.20) can be used even if the obtained samples are contaminated by 
noise. 

11.6.1 SEMIPARAMETRIC REPRESENTER THEOREM 

The use of the bias term is also theoretically justified by the generalization of the representer theorem 
[103]. The essence of this theorem is to expand the solution into two parts, i.e., one that lies in an 
RKHS H, and another that is given as a linear combination of a set of preselected functions. 

Theorem 11.3. Let us assume that in addition to the assumptions adopted in Theorem 11.2, we are 
given the set of real-valued functions 

i jr m : X i—» R, m — 1,2,..., M, 

with the property that the N x M matrix with elements x[r m (x„), n = 1, 2,..., N, m — 1,2, .... M, 
has rank M. Then any 

f = f + h, / ei, h e span{f m , m = 1,2,..., M}, 
solving the minimization task 

N 

min y(/):=^£(y„,/(x„)) + ^(ll/l| 2 ) (11.21) 

f n =1 

admits the following representation: 

N M 

f (*) ~ ^ , Xyi) ~h ^ ^ b m \Jf m (•) . 

/ 2=1 m= 1 


( 11 . 22 ) 
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Obviously, the use of a bias term is a special case of the expansion above. An example of success- 
ful application of this theorem was demonstrated in [13] in the context of image denoising. A set of 
nonlinear functions in place of i //„, were used to account for the edges (nonsmooth jumps) in an image. 
The part of / lying in the RKHS accounted for the smooth parts in the image. 

11.6.2 NONPARAMETRIC MODELING: A DISCUSSION 

Note that searching a model function in an RKHS space is a typical task of nonparametric modeling. 
In contrast to the parametric modeling in Eq. (11.1), where the unknown function is parameterized in 
terms of a set of basis functions, the minimization in (11.16) or (1 1.21) is performed with regard to 
functions that are constrained to belong in a specific space. In the more general case, minimization 
could be performed with regard to any (continuous) function, for example, 

N 

min f(x n )) +10(/), 

f n =1 

where £,(■. ■) can be any loss function and tp an appropriately chosen regularizing functional. Note, 
however, that in this case, the presence of the regularization is crucial. If there is no regularization, then 
any function that interpolates the data is a solution; such techniques have also been used in interpolation 
theory (for example, [78,90]). The regularization term, </>(/), helps to smoothen out the function to be 
recovered. To this end, functions of derivatives have been employed. For example, if the minimization 
cost is chosen as 

N 

(.yn - f(x n )) 2 + X 

n =1 

then the solution is a cubic spline; that is, a piecewise cubic function with knots the points x „, n = 
1,2 ,N, and it is continuously differentiable to the second order. The choice of 7. Controls the 
degree of smoothness of the approximating function; the larger its value, the smoother the minimizer 
becomes. 

If, on the other hand, / is constrained to lie in an RKHS and the minimizing task is as in (11.16), 
then the resulting function is of the form given in (1 1.17), where a kernel function is placed at each 
input training point. It must be pointed out that the parametric form that now results was not in our 
original intentions. It came out as a by-product of the theory. However, it should be stressed that, in 
contrast to the parametric methods, now the number of parameters to be estimated is not fixed but it 
depends on the number of the training points. Recall that this is an important difference and it was 
carefully pointed out when parametric methods were introduced and defined in Chapter 3. 



J (f"(x)) 2 dx, 


11.7 KERNEL RIDGE REGRESSION 

Ridge regression was introduced in Chapter 3 and it has also been treated in more detail in Chapter 6. 
Here, we will state the task in a general RKHS. The path to be followed is the typical one used to 
extend techniques which have been developed for linear models to the more general RKHS. 
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We assume that the generation mechanism of the data, represented by the training set (y n ,x n ) e 
R x I ( , is modeled via a nonlinear regression task 

y n = g(x n ) + ri n , n=\,2,...,N. (11.23) 

Let us denote by / the estimate of the unknown g. Sometimes, / is called the hypothesis and the space 
H in which / is searched is known as the hypothesis space. We will further assume that / lies in an 
RKHS, associated with a kernel 

rl'x8'i— >R. 

Motivated by the representer theorem, we adopt the following expansion: 

N 

f(x) = '^ / d„K(x,X n ). 

n= 1 

According to the kernel ridge regression approach, the unknown coefficients are estimated by the fol¬ 
lowing task: 


6 = argmin/(<?), 
e 

N / N \ 2 

J(0) := + C</, />, (11.24) 

n= 1 V m= 1 / 

where C is the regularization parameter. Eq. (1 1.24) can be rewritten as (Problem 11.7) 

J(0) = (y- )C0) T (y - O) + C0 T JC T 0, (11.25) 

where 

y = [yi,...,yj V ] 7 ’, 0 — [Gi,..., 6 N ] T , 

and /C is the kernel matrix defined in (11.11); the latter is fully determined by the kernel function and 
the training points. Following our familiar-by-now arguments, minimization of J(0) with regard to 0 
leads to 

{)C T 1C + C1C T )0 = K7 y 


or 


(/C + CI)0 = y : kernel ridge regression, 


(11.26) 


where JC 1 =/C has been assumed to be invertible. Once 0 has been obtained, given an unknown 


^ For the needs of this chapter, we denote the regularization constant as C, not to be confused with the Lagrange multipliers, to 
be introduced soon. 

10 This is true, for example, for the Gaussian kernel [103]. 
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FIGURE 11.9 

Plot of the data used for training together with the fitted (prediction) curve obtained via the kernel ridge regression, 
for Example 11.2. The Gaussian kernel was used. 


vector x e R 7 , the corresponding prediction value of the dependent variable is given by 

N 

y = y ^§„k(x, x n ) = 6 k{x), 

n= 1 


where 


k(x) = [k(x, jci), ..., /c(x, JCAt)] r . 


Employing (11.26), we obtain 


y(x) = y T (/C + CI) 'k(x). 


(11.27) 


Example 11.2. In this example, the prediction power of the kernel ridge regression in the presence 
of noise and outliers will be tested. The original data were samples from a music recording of Blade 
Runner by Vangelis Papathanasiou. A white Gaussian noise was then added at a 15 dB level and 
a number of outliers were intentionally randomly introduced and “hit” some of the values (10% of 
them). The kernel ridge regression method was used, employing the Gaussian kernel with a — 0.004. 
We allowed for a bias term to be present (see Problem 11.8). The prediction (fitted) curve, y (x ), for 
various values of x is shown in Fig. 11.9 together with the (noisy) data used for training. 
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FIGURE 11.10 

The Huber loss function (dotted-gray), the linear e-insensitive (full-gray), and the quadratic e-insensitive (red) loss 
functions, for e = 0.7. 


11.8 SUPPORT VECTOR REGRESSION 

The squared error loss function, associated with the LS method, is not always the best criterion for 
optimization, in spite of its merits. In the case of the presence of a non-Gaussian noise with long 
tails (i.e., high probability for relatively large values) and, hence, with an increased number of noise 
outliers, the square dependence on the error of the squared error criterion gets biased toward values 
associated with the presence of outliers. Recall from Chapter 3 that the LS method is equivalent to the 
maximum likelihood estimation under the assumption of white Gaussian noise. Moreover, under this 
assumption, the LS estimator achieves the Cramer-Rao bound and it becomes a minimum variance 
estimator. However, under other noise scenarios, one has to look for alternative criteria. 

The task of optimization in the presence of outliers was studied by Huber [54], whose goal was to 
obtain a strategy for choosing the loss function that “matches” the noise model best. He proved that 
under the assumption that the noise has a symmetric probability density function (PDF), the optimal 
minimax strategy for regression is obtained via the following loss function: 

C(y, f(x)) = \y - f(x)\, 

which gives rise to the least-modulus method. Note from Section 5.8 that the stochastic gradient Online 
version for this loss function leads to the sign-error LMS. Huber also showed that if the noise comprises 
two components, one corresponding to a Gaussian and another to an arbitrary PDF (which remains 
symmetric), then the best in the minimax sense loss function is given by 

w v f, x) \ = { el.v- f(x)\- 4, if ly-/(x)| >e, 

l \\y- f(x)\ 2 , if |y — f(x)\ < e, 

for some parameter e. This is known as the Huber loss function, and it is shown in Fig. 11.10. A loss 
function that can approximate the Huber one and, as we will see, turns out to have some nice compu- 


11 The best L 2 approximation of the worst noise model. Note that this is a pessimistic scenario, because the only information 
that is used is the symmetry of the noise PDF. 
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tational properties, is the so-called linear e-insensitive loss function, defined as (see also Chapter 8) 


£(?./(*)) 


\y-f(x)\-e, if \y-f(x)\>e, 
0, if \y-f(x)\<e, 


(11.28) 


shown in Fig. 11.10. Note that for e — 0, it coincides with the absolute value loss function, and it is 
close to the Huber loss for small values of e < 1. Another version is the quadratic e-insensitive, defined 
as 


£{y, /(*)) 


\y-f{x))\ 2 -e, if |y-/(x)|>€, 
0, if \y-f(x) |<e, 


(11.29) 


which coincides with the squared error loss for e = 0. The corresponding graph is given in Fig. 11.10. 
Observe that the two previously discussed e-insensitive loss functions retain their convex nature; how- 
ever, they are no more differentiable at ali points. 


11.8.1 THE LINEAR e-INSENSITIVE 0PTIMAL REGRESSION 

Let us now adopt (11.28) as the loss function to quantify model misfit. We will treat the regression task 
in (11.23), employing a linear model for /, that is, 

f(x) — 0 T x + do- 

Once we obtain the solution expressed in inner product operations, the more general solution for the 
case where / lies in an RKHS will be obtained via the kernel trick; that is, inner products will be 
replaced by kernel evaluations. 

Let us now introduce two sets of auxiliary variables. If 

y n -0 T x n -9o>e, 

detine Hn > 0, such that 

y n -0 T x n -9o<€ + ln- 

Note that ideally, we would like to select 0, 6q, so that = 0, because this would make the contribution 
of the respective term in the loss function equal to zero. Also, if 

y n - 0 1 X n -Oo< -e , 


detine Hn >0, such that 


9 X n “b 0q yn — ^ ~E ^n 


Once more, we would like to select our unknown set of parameters so that is zero. 
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We are now ready to formulate the minimizing task around the corresponding empirical cost, regu- 
larized by the norm of 0 , which is cast in ternis of the auxiliary variables as 1 ” 

1 ( N N \ 

j(6,e 0 ,^l) = -\\e\\ 2 + c , (ii.30) 

\«=1 n =1 / 

y n -0 T x n -0 o <€+%„, n = l,2,...,N, (11.31) 

0 T x„ + 0o - y n <e + ^ n , n = 1,2,..., N, (11.32) 

fn >0, >0, n= 1,2, (11.33) 

Before we proceed further some explanations are in order. 

• The auxiliary variables and £„, n = 1,2,. .., N, which measure the excess error with regard to 
e , are known as slack variables. Note that according to the 6-insensitive rationale, any contribution 

to the cost function of an error with absolute value less than or equal to e is zero. The previous 

optimization task attempts to estimate 0, 9q so that the contribution of error values larger than e 
and smaller than —e is minimized. Thus, the optimization task in (11.30)— (1 1.33) is equivalent to 
minimizing the empirical loss function 

1 N 

-\\0\\ 2 + cJ2£(ynJ T x n + e Q y 

n= 1 

where the loss function is the linear e-insensitive one. Note that any other method for minimizing 
(nondifferentiable) convex functions could be used (for example, Chapter 8). However, the con- 
strained optimization involving the slack variables has a historical value and it was the path that 
paved the way in employing the kernel trick, as we will see soon. 

The Solution 

The solution of the optimization task is obtained by introducing Lagrange multipliers and forming 
the corresponding Lagrangian (see below for the detailed derivation). Having obtained the Lagrange 
multipliers, the solution turns out to be given in a simple and rather elegant form, 

N 

0 — ^ X n )Xn , 

n= 1 


minimize 
subject to 


where /„„, a„ , n = 1, 2,..., N , are the Lagrange multipliers associated with each one of the constraints 
in (11.31) and (11.32). It turns out that the Lagrange multipliers are nonzero only for those points x„ 
that correspond to error values either equal to or larger than e. These are known as support vectors. 
Points that score error values less than e correspond to zero Lagrange multipliers and do not participate 


12 It is common in the literature to formulate the regularized cost via the parameter C multiplying the loss term and not | $ 11". 
In any case, they are both equivalent. 
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in the formation of the solution. The bias term can be obtained by anyone from the set of equations 

y n -0 T x n -e 0 = e, (11.34) 

0 T x n + d o - y n =e, (11.35) 

where n above runs over the points that are associated with a„ > 0 (),„ > 0) and = 0 (§„ = 0) (note 
that these points form a subset of the support vectors). In practice, 0q is obtained as the average from 
ali the previous equations. 

For the more general setting of solving a linear task in an RKHS, the solution is of the same form 
as the one obtained before Ali one needs to do is to replace x„ with the respective images, k(-. x„), 
and the vector 6 by a function, 6, i.e., 


N 

§(■)= y^(X w - x„)k(-,x„). 

n =1 

Once 0, (k) have been obtained, we are ready to perform prediction. Given a value x , we first perform 
the (implicit) mapping using the feature map 

* i —> k{-,x), 


and we get 


y(x) = l§,K(-,x) ] j + § 0 , 


N, 


y(x) = ^J(A.„ - X„)k(x, x„) + 9 o : 

n= 1 

SVR prediction, 


(11.36) 


where N s < N is the number of nonzero Lagrange multipliers. Observe that (11.36) is an expansion 
in terms of nonlinear (kernel) functions. Moreover, as only a fraction of the points is involved (N s ), 
the use of the e-insensitive loss function achieves a form of sparsification on the general expansion 
dictated by the representer theorem in (11.17). 


Solving the Optimization Task 

The reader who is not interested in proofs can bypass this part in a first reading. 

The task in (1 1.30) —(1 1.33) is a convex programming minimization, with a set of linear inequality 
constraints. As discussed in Appendix C, a minimizer has to satisfy the following Karush-Kuhn-Tucker 
(KKT) conditions: 


3 L 3 L 3 L 3 L 

— = 0 , -= 0 , — = 0 , -= 0 , 

3 9 dOo 3 f„ 3 Hn 

K (. y n — 9 T x„ — 0o — e — $ n ) = 0, n=l,2,..., N, 

K(9 T x n + 6> 0 - y n - € - £„) = 0, n = 1,2,..., N, 


(11.37) 

(11.38) 

(11.39) 
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— 0, n — 1,2,..., 7V, 

A„ > 0, A„ > 0, /x„ >0, fx n > 0, n = 1,2,..., N, 
where L is the respective Lagrangian, 


L(6,6q , f, A, jtf.) 


1 / W * _ \ 
-|| 0|| 2 + c|^f n + ^i n | 

\n=l n =1 / 

N 

"E ^ (v 7 , 0 x n 0o ^ fu) 

77=1 

N 

+ A„(0 ? ■*« + #0 — V71 — f — fu) 

71=1 


Al 


-E^ n_ E l ^n ■ 


«=1 


71=1 


(11.40) 

(11.41) 


(11.42) 


where /.„, , /7„, //„ are the corresponding Lagrange multipliers. A close observation of (1 1.38) and 

(1 1.39) reveals that (why?) 


= 0, A„A„ = 0, n = 1, 2,..., N. (11.43) 

Taking the derivatives of the Lagrangian in (11.37) and equating to zero results in 

ar N 

^=0- 

de 

“=o- 

90o 

e = o- 

a$„ 

“=o- 

dHn 

Note that ali one needs in order to obtain 6 are the values of the Lagrange multipliers. As discussed in 
Appendix C, these can be obtained by writing the problem in its dual representation form, that is, 


^ ^ ^(^72 h n )x n , 

n= 1 

N N 

^ ^ A-n = ^ ^ A. n , 

72=1 72=1 

(11.44) 

(11.45) 

C A. 72 /X72 — 0, 

(11.46) 

C l~^n — 0* 

(11.47) 


N 


maximize with respect to A, A 


'y 0-n A72) V/7 t(A71 ~E A n ) 


n= 1 


N N 


n. ^ ^ ' O^n A, w )(X m ^m)X n 


72 = 1 m =1 


(11.48) 
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subject to 0 < X„ < C, 0 < A„ < C, n = 1, 2, ..., N, ( 11 . 49 ) 

N N 

Y J K = Y J k n. ( 11 - 50 ) 

n= 1 n= 1 

Concerning the maximization task in (11.48) —(1 1.50), the following comments are in order. 

• Plugging into the Lagrangian the estimate obtained in (11.44) and following the steps as required 
by the dual representation form (Problem 1 1. 10), (1 1 .48) results. 

• From (1 1.46) and (1 1.47), taking into account that > 0, /i„ > 0, (11.49) results. 

• The beauty of the dual representation form is that it involves the observation vectors in the form of 
inner product operations. Thus, when the task is solved in an RKHS, (11.48) becomes 


maximize with respect to X, X 


N 

^ ' (^-» h n )yn r(X,, + Xn) 

n =1 


N N 


EEE (X n ^n)(?^m )lc(Xn, X m ). 


n— 1 m =1 


• The KKT conditions convey important information. The Lagrange multipliers, X n ,X n , for points 
that score error less than e, that is, 


\0 r x n + d o -y n \ <e, 

are zero. This is a direct consequence of (1 1.38) and (1 1.39) and the fact that f„, > 0. Thus, the 

Lagrange multipliers are nonzero only for points which score error either equal to e (§„, = 0) 

or larger values (!„,£„ > 0). In other words, only the points with nonzero Lagrange multipliers 
(support vectors) enter in (1 1.44), which leads to a sparsification of the expansion in (1 1.44). 

• Due to ( 11 .43), either or can be nonzero, but not both of them. This also applies to the corre- 
sponding Lagrange multipliers. 

• Note that if f„ > 0 (or > 0), then from (1 1.40), (1 1.46), and (1 1.47) we obtain 

A — C or A„ = C. 

That is, the respective Lagrange multipliers get their maximum value. In other words, they have a 
“big say” in the expansion in (1 1.44). When and/or are zero, then 

0 < K < C, 0 < A„ < C. 

• Recall what we have said before concerning the estimation of 0o■ Select any point corresponding to 
0 < X n < C (0 < /.„ < C) which we know correspond to = 0 (£„ = 0). Then % is computed from 
(1 1.38) and (1 1.39). In practice, one selects all such points and computes 0q as the respective mean. 

• Fig. 11.11 illustrates y(x) for a choice of at(-, •). Observe that the value of e forms a “tube” around 
the respective graph. Points lying outside the tube correspond to values of the slack variables larger 
than zero. 
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FIGURE 11.11 

The tube around the nonlinear regression curve. Points outside the tube (denoted by stars) have either £ > 0 and 
f = 0 or f > 0 and f = 0. The rest of the points have f = £ = 0. Points that are inside the tube correspond to zero 
Lagrange multipliers. 


Remarks 11.3. 

• Besides the linear e-insensitive loss, similar analysis is valid for the quadratic e-insensitive and 
Huber loss functions (for example, [28]).It turns out that use of the Huber loss function results in a 
larger number of support vectors. Note that a large number of support vectors increases complexity, 
as more kernel evaluations are involved. 

• Sparsity and e-insensitive loss function: Note that (11.36) has exactly the same form as (11.18). 
However, in the former case, the expansion is a sparse one using N s < N, and in practice often 
N s N. The obvious question that is now raised is whether there is a “hidden” connection be- 
tween the e-insensitive loss function and the sparsity promoting methods, discussed in Chapter 9. 
Interestingly enough, the answer is in the affirmative [47]. Assuming the unknown function, g, in 
(11.23) to reside in an RKHS, and exploiting the representer theorem, it is approximated by an 
expansion in an RKHS and the unknown parameters are estimated by minimizing 

1 JV N 

L(e) = -\y{-)-Y J 0ni<{-,x n )l 2 n + eY J \0n\. 

71=1 71=1 


This is similar to what we did for the kernel ridge regression with the notable exception that the 
t\ norm of the parameters is involved for regularization. The norm || • ||h denotes the norm asso- 
ciated with the RKHS. Elaborating on the norm, it can be shown that for the noiseless case, the 
minimization task becomes identical with the SVR one. 

Example 11.3. Consider the same time series used for the nonlinear prediction task in Example 11.2. 
This time, the SVR method was used optimized around the linear e-insensitive loss function, with 
e = 0.003. The same Gaussian kernel with a = 0.004 was employed as in the kernel ridge regression 
case. Fig. 11.12 shows the resulting prediction curve, y(x) as a function of x given in (11.36). The 
encircled points are the support vectors. Even without the use of any quantitative measure, the resulting 
curve fits the data samples much better compared to the kernel ridge regression, exhibiting the enhanced 
robustness of the SVR method relative to the kernel ridge regression, in the presence of outliers. 
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FIGURE 11.12 

The resulting prediction curve for the same data points as those used for Example 11.2. The improved performance 
compared to the kernel ridge regression used for Fig. 11.9 is readily observed. The encircled points are the support 
vectors resulting from the optimization, using the e-insensitive loss function. 


Remarks 11.4. 

• A more recent trend to deal with outliers is via their explicit modeling. The noise is split into two 
components, the inlier and the outlier. The outlier part has to be spare; otherwise, it would not 
be called outlier. Then, sparsity-related arguments are mobilized to solve an optimization task that 
estimates both the parameters as well as the outliers (see, for example, [15,71,80,85,86]). 


11.9 KERNEL RIDGE REGRESSION REVISITED 

The kernel ridge regression was introduced in Section 1 1.7. Here, it will be restated via its dual repre- 
sentation form. The ridge regression in its primal representation can be cast as 

N 

minimize with respect to 0 , £ J(0, !•) — ^+ C||0|| 2 , (11.51) 

n=1 

subject to y n — 0 T x n = , n = 1,2 ,..., N, 

which leads to the following Lagrangian: 

N N 

L(0, £, X) = £ + C\\0\\ 2 + X »(y>‘ - eT *n - Hn), n = 1,2,..., N. 

n =1 n =1 


(11.52) 
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Differentiating with respect to 0 and f„, n = 1,2,..., N, and equating to zero, we obtain 



(11.53) 


and 



(11.54) 


To obtain the Lagrange multipliers, (1 1.53) and (1 1.54) are substituted in (11.52), which results in the 
dual formulation of the problem, that is, 



maximize with respect to A 



(11.55) 


/ 2=1 


where we have replaced x T n x m with the kernel operation according to the kernel trick. It is a matter of 
straightforward algebra to obtain (see [98] and Problem 1 1.9) 


A = 2C(/C + CI)~ l y, 


(11.56) 


which combined with (11.53) and involving the kernel trick we obtain the prediction rule for the kernel 
ridge regression, that is, 


y(x) = y T (IC + CI) 1 k(x), 


(11.57) 


which is the same as (11.27); however, via this path one needs not to assume invertibility of /C. An 
efficient scheme for solving the kernel ridge regression has been developed in [1 18,1 19]. 


11.10 OPTIMAL MARGIN CLASSIFICATION: SUPPORT VECTOR MACHINES 


The optimal classifier, in the sense of minimizing the classification error, is the Bayesian classifier as 
discussed in Chapter 7. The method, being a member of the generative learning family, requires the 
knowledge of the underlying probability distributions. If these are not known, an alternative path is 
to resort to discriminative learning techniques and adopt a discriminant function / that realizes the 
corresponding classifier and try to optimize it so as to minimize the respective empirical loss, that is, 


N 


/(/) = ^£(y„,/(*„)), 


where 


yn — 


+ 1, if x„ e coi, 

— 1, if x n ea> 2 . 





11.10 OPTIMAL MARGIN CLASSIFICATION: SUPPORT VECTOR MACHINES 563 


For a binary classification task, the first loss function that comes to mind is 


£{y, fM) 


1, if yf(x)<0, 
0, otherwise, 


(11.58) 


which is also known as the (0, l)-loss function. However, this is a discontinuous function and its 
optimization is a hard task. To this end, a number of alternative loss functions have been adopted in an 
effort to approximate the (0, l)-loss function. Recall that the squared error loss can also be employed 
but, as already pointed out in Chapters 3 and 7, this is not well suited for classification tasks and bears 
little resemblance to the (0, l)-loss function. In this section, we turn our attention to the so-called hinge 
loss function defined as (Chapter 8) 

£ P {y, fM) = max {°> p - yfM}- (n.59) 

In other words, if the sign of the product between the true label (y) and that predicted by the discrimi¬ 
nant function value (f(x)) is positive and larger than a threshold/margin (user-defined) value p > 0, 
the loss is zero. If not, the loss exhibits a linear increase. We say that a margin error is committed if 
yfM cannot achieve a value of at least p. The hinge loss function is shown in Fig. 11.13, together 
with (0, 1) and squared error loss functions. 

We will constrain ourselves to linear discriminant functions, residing in some RKHS, of the form 


fM = 0 0+ {0, <PM), 


where, by definition, 


0(jc) := k(-, jc) 



FIGURE 11.13 


The (0, l)-loss (dotted red), the hinge loss (full red), and the squared error (dotted black) functions tuned to pass 
through the (0, 1) point for comparison. For the hinge loss, p = 1. Also, for the hinge and (0, l)-loss functions 
r = yf(x ), and for the squared error one, z = y — f(x). 








564 CHAPTER 11 LEARNING IN REPRODUCING KERNEL HILBERT SPACES 


is the feature map. However, for the same reasons discussed in Section 11.8.1, we will cast the task 
as a linear one in the input space, R 7 , and at the final stage the kernel information will be “implanted” 
using the kernel trick. 

The goal of designing a linear classifier now becomes equivalent to minimizing the cost 


1 

J(fi,0o) = -\\o\\ 2 + cJ2^yn,o T x n +e 0 ). (ii.60) 

n=1 

Alternatively, employing slack variables, and following a similar reasoning as in Section 1 1.8.1, mini¬ 
mizing (1 1.60) becomes equivalent to 


minimize with respect to 0, Oo , § 
subject to 


/(M)=^I|0|I 2 + cX> 

n=1 

)'n (0 X n "E $o) — P ("V/■ 
§„>0, n=l,2,...,N. 


(11.61) 

(11.62) 

(11.63) 


From now on, we will adopt the value p — 1, without harming generality. A margin error is committed 
if y„(0 T x n + do) < 1, corresponding to > 0, otherwise the inequality in (1 1.62) is not satisfied. For 
the latter to be satisfied, in case of a margin error, it is necessary that > 0 and this contributes to the 
cost in Eq. (1 1.61). On the other hand, if = 0, then y n (0 T : c„ + Oo) > I and there is no contribution 
in the cost function. Observe that the smaller the value of y„ (d 7 x„ + Oo) is, with respect to the margin 
p=l, the larger the corresponding should be for the inequality to be satisfied and, hence, the larger 
the contribution in the cost in Eq. (1 1.61) is. Thus, the goal of the optimization task is to drive as many 
of the £„s to zero as possible. The optimization task in (1 1.61)—(11.63) has an interesting and important 
geometric interpretation. 


11.10.1 LINEARLY SEPARABLE CLASSES: MAXIMUM MARGIN CLASSIFIERS 

Assuming linearly separable classes, there is an infinity of linear classifiers that solve the classification 
task exactly, without committing errors on the training set (see Fig. 1 1.14). It is easy to see, and it will 
become apparent very soon, that from this infinity of hyperplanes that solve the task, we can always 
identify a subset such as 

y n (0 T x n +d o )> 1, n=l,2 , 

which guarantees that = 0, n=l,2,...,N, in (11.61 )—( 11 .63). Hence, for linearly separable 
classes, the previous optimization task is equivalent to 

1 

minimize with respect to 0 -||0||“, (11.64) 

subjectto yn(0 T x n + 6o) > 1, n = 1,2,..., N. (11.65) 

In other words, from this infinity of linear classifiers, which can solve the task and classify correctly ali 
training patterns, our optimization task selects the one that has minimum norm. As will be explained 
next, the norm ||0|| is directly related to the margin formed by the respective classifier. 
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FIGURE 11.14 

There is an infinite number of linear classifiers that can classify correctly all the patterns in a linearly separable class 
task. 



FIGURE 11.15 

The direction of the hyperplane 0 T x + (?o = 0 is determined by 6 and its position in space by @o. Note that 0 has 
been shifted away from the origin, since here we are only interested to indicate its direction, which is perpendicular 
to the straight line. 


Each hyperplane in space is described by the equation 

/(*) = 6 T x + d 0 = 0. (11.66) 

From classical geometry (see also Problem 5.12), we know that its direction in space is controlled 
by 6 (which is perpendicular to the hyperplane) and its position is controlled by 6q (see Fig. 11.15). 
From the set of all hyperplanes that solve the task exactly and have certain direction (i.e., they share 
a common 0), we select 6q so as to place the hyperplane in between the two classes, such that its 
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FIGURE 11.16 

For each direction 0, “red” and “gray,” the (linear) hyperplane classifier, 0 T x + 9^ = 0 (full lines), is placed in 
between the two classes and normalized so that the nearest points from each class have a distance equal to one. 

The dotted lines, 0 T x + 9q = ±1, which pass through the nearest points, are parallel to the respective classifier and 
define the margin. The width of the margin is determined by the direction of the corresponding classifier in space 
and is equal to pjy. 


distance from the nearest points from each one of the two classes is the same. Fig. 11.16 shows the 
linear classifiers (hyperplanes) in two different directions (full lines in gray and red). Both of them 
have been placed so as to have the same distance from the nearest points in both classes. Moreover, 
note that the distance z\ associated with the “gray” classifier is smaller than the 7.2 associated with the 
“red” one. 

From basic geometry, we know that the distance of a point x from a hyperplane (see Fig. 1 1.15) is 
given by 


\O t x + 9 0 \ 



which is obviously zero if the point lies on the hyperplane. Moreover, we can always scale by a con¬ 
stant factor, say, a, both 6 and Oq without affecting the geometry of the hyperplane, as described by 
Eq. ( 1 1 . 66 ). After an appropriate scaling, we can always make the distance of the nearest points from 
the two classes to the hyperplane equal to z = pjj-; equivalently, the scaling guarantees that f(x) = ±1 
if x is a nearest to the hyperplane point and depending on whether the point belongs to co 1 (+1) or 
u>2 (— 1). The two hyperplanes, defined by / (x) = ± 1, are shown in Fig. 11.1 6 as dotted lines, for both 
the “gray” and the “red” directions. The pair of these hyperplanes defines the corresponding margin, 
for each direction, whose width is equal to . 

Thus, any classifier that is constructed as explained before, and which solves the task, has the 
following two properties: 

• it has a margin of width equal to + pjj-; 

• it satisfies the two sets of constraints. 


0^X n -|- $0 — “El > X n £ (0\ , 
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O 1 x n + 0 o<—1, x n eco 2 - 


Hence, the optimization task in (1 1 .64)— C1 1.65) computes the linear classifier which maximizes the 
inargin subject to the constraints. 

The margin interpretation of the regularizing term \\6 1| 2 ties nicely the task of designing classifiers, 
which maximize the margin, with the statistical learning theory and the pioneering work of Vapnik- 
Chervonenkis, which establishes elegant performance bounds on the generalization properties of such 
classifiers (see, for example, [28,124,128,129]). 

The Solution 

Following similar steps as for the support vector regression case, the solution is given as a linear 
combination of a subset of the training samples, that is, 

N s 

0 = y^KynXn, ( 11 . 67 ) 

n =1 

where N s is the number of the nonzero Lagrange multipliers. It turns out that only the Lagrange multi- 
pliers associated with the nearest-to-the-classifier points, that is, those points satisfying the constraints 
with equality (y n (0 1 x n + Oo) = l ), are nonzero. These are known as the support vectors. The Lagrange 
multipliers corresponding to the points farther away (y„ (0 1 x n + @o) > 1) are zero. The estimate of the 
bias term, %, is obtained by selecting ali constraints with /.„ ^ 0, corresponding to 

y,,(0 T x„ + §o) - 1 = 0, n=l,2,...,N s , 

solving for 0q and taking the average value. 

For the more general RKHS case, the solution is a function given by 

N, 

O(-) = y^KynK(-,x„), 

n= 1 

which leads to the following prediction rule. Given an unknown x, its class label is predicted according 
to the sign of 

■v(x) = { 0 , k (-, x ))+ 9 o, 


or 

N s 

y(x ) = E X n y n K{x,x n ) + §o: support vector machine prediction. 

n =1 

Similarly to the linear case, we select ali constraints associated with ^ 0, i.e., 

( Ns 

yn I ^ ^ 't-m y,„ k( x m , x n ) + 0{) 

\m =1 


( 11 . 68 ) 


-1=0, n = l,2,...,N s , 
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and % is computed as the average of the values obtained from each one of these constraints. Although 
the solution is unique, the corresponding Lagrange multipliers may not be unique (see, for example, 
[124]). 

Finally, it must be stressed that the number of support vectors is related to the generalization per- 
formance of the classifier. The smaller the number of support vectors, the better the generalization is 
expected to be [28,124]. 

The Optimization Task 

This part can also be bypassed in afirst reading. 

The task in (1 1 .64)— (1 1.65) is a quadratic programming task and can be solved following similar 
steps to those adopted for the SVR task. 

The associated Lagrangian is given by 


1 N 

L(0,0 o ,X) = -||0|| 2 - X2" (' v,! {o' Tx '' +0 °) - ’) 

n =1 


(11.69) 


and the KKT conditions (Appendix C) become 


^L(6,e o , j .) = o —► e = Yx n y nXn 

Ou z ' 


n =1 


N 


— L(O, 0 o ,\) = O-+ Yx n y n = 0 , 

o0 o 


n —1 


K{yn(0 T x n + 0o) - 1) = 0, n = l,2,...,N, 
k n >0, /7 = 1,2. N. 


(11.70) 

(11.71) 

(11.72) 

(11.73) 


The Lagrange multipliers are obtained via the dual representation form after plugging (1 1.70) into the 
Lagrangian (Problem 11.11), that is, 


N 


N N 


maximize with respect to X 
subject to 


'y ' X n ^ y ^ y ^ X n X m y n y m x n x n 

n= 1 n=l m=l 

K > o, 

N 

y '] X n y n — 0 . 

71=1 


(11.74) 

(11.75) 

(11.76) 


For the case where the original task has been mapped to an RKHS, the cost function becomes 

N i N N 

^n^m,ynym^i^n ■> %m )• 

n=l /i=l m —1 

• According to (11.72), if X n ^ 0, then necessarily 

y n (0‘ x n + e 0 ) = 1 . 


(11.77) 
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That is, the respective points are the closest points, from each class, to the classifier (distance jj^)- 
They lie on either of the two hyperplanes forming the border of the margin. These points are the 
support vectors and the respective constraints are known as the active constraints. The rest of the 
points, associated with 

y n (0 T x n +d o) > 1, (11.78) 

which lie outside the margin, correspond to = 0 ( inactive constraints ). 

• The cost function in (11.64) is strictly convex and, hence, the solution of the optimization task is 
unique (Appendix C). 

11.10.2 NONSEPARABLE CLASSES 

We now turn our attention to the more realistic case of overlapping classes and the corresponding 
geometric interpretation of the task in (11.61)— (1 1.63). In this case, there is no (linear) classifier that 
can classify correctly all the points, and some errors are bound to occur. Fig. 11.17 shows the respective 
geometry for a linear classifier. Note that although the classes are not separable, we stili deline the 
margin as the area between the two hyperplanes f(x) = ±1. There are three types of points. 

• Points that lie on the border or outside the margin and in the correct side of the classifier, that is, 

y,if(x n ) > l. 

These points commit no (margin) error, and hence the inequality in (11.62) is satisfied for 

£» = 0 . 

• Points which lie on the correct side of the classifier, but lie inside the margin (circled points), that is, 

0 < y n f(x n ) < 1- 



FIGURE 11.17 

When classes are overlapping, there are three types of points: (a) points that lie outside or on the borders of the mar¬ 
gin and are classified correctly (f„ = 0); (b) points inside the margin and classified correctly (0 < < 1), denoted 

by circles; and (c) misclassified points, denoted by a square (£„ >1). 






570 CHAPTER 11 LEARNING IN REPRODUCING KERNEL HILBERT SPACES 


These points commit a margin error, and the inequality in (1 1.62) is satisfied for 

0 <?„<!. 


• Points that lie on the wrong side of the classifier (points in squares), that is, 

y n f(x „) < 0. 

These points commit an error and the inequality in (11.62) is satisfied for 


1 <%n 


Our desire would be to estimate a hyperplane classifier, so as to maximize the margin and at the 
same time keep the number of erro rs (including margin errors) as small as possible. This goal could be 
expressed via the optimization task in (1 1.61)—(1 1 .62), if in place of we had the indicator function, 
/(£„), where 


/(?) = 


1 , if £ > 0, 
0 , if£ = 0. 


However, in such a case the task becomes a combinatorial one. So, we relax the task and use f , in place 
of the indicator function. leading to (11.61 )—( 11 .62). Note that optimization is achieved in a tradeoff 
rationale; the user-defined parameter C Controls the influence of each of the two contributions to the 
minimization task. If C is large, the resulting margin (the distance between the two hyperplanes defined 
by f(x ) = ±1) will be small, in order to commit a smaller number of margin errors. If C is small, the 
opposite is true. As we will see from the simulation examples, the choice of C is critical. 


The Solutiori 

Once more, the solution is given as a linear combination of a subset of the training points, 


N s 

0 — y ' i-ti y„Xn , 

n =1 


(11.79) 


where X„, n — 1,2,..., /V v , are the nonzero Lagrange multipliers associated with the support vectors. 
In this case, support vectors are ali points that lie either (a) on the pair of the hyperplanes that define 
the margin or (b) inside the margin or (c) outside the margin but on the wrong side of the classifier. 
That is, correctly classified points that lie outside the margin do no contribute to the solution, because 
the corresponding Lagrange multipliers are zero. For the RKHS case, the class prediction rule is the 
same as in (1 1.68), where §o is computed from the constraints corresponding to k n f 0 and f , = 0; 
these correspond to the points that lie on the hyperplanes defining the margin and on the correct side 
of the classifier. 

The Optimization Task 

As before, this part can be bypassed in afirst reading. 
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The Lagrangian associated with (1 1.61) — (1 1.63) is given by 

1 N N 

W, Oo,$,V=- 11011 2 + 

n= 1 / 2=1 

N 

~ n (v n (@ Tx n + — 1 + , 

/ 2=1 


leading to the following KKT conditions: 



N 

0 — ^ Ai )'n x n . 
/2 = 1 


dL 

Mo 


N 

^ ^ hnyn — 0 , 
/ 2=1 


-— = 0 —> C — fi n — k n = 0 , 

OSn 

kn{yn(0 T X n + 0 0 )-l+t; n )=0, n=l,2,...,N, 
= 0 , n = 1 , 2 , . . . , N, 

> 0, X n > 0, n — 1 , 2 ,..., N, 


(11.80) 

(11.81) 

(11.82) 

(11.83) 

(11.84) 

(11.85) 


and in our by-now-familiar procedure, the dual problem is cast as 


maximize with respect to X 

N ^ N N 

^ ^ kn ^ ^ ^ ^ k n X m y n y m X n 

(11.86) 

subject to 

/ 2=1 n=l m=l 

0 <k n <C, n=l,2,...,N, 

N 

^ ^ k nyn ~ 0 . 

(11.87) 


(11.88) 


n= 1 


When working in an RKHS, the cost function becomes 

N j N N 

2 


N j JV N 

- 9 


ynym K ( x n • x m ) ■ 


n =1 


n =1 m=l 


Observe that the only difference compared to its linearly class-separable counterpart in (11 .74)—(1 1.76) 
is the existence of C in the inequality constraints for X n . The following comments are in order: 

• From (1 1.83), we conclude that for ali the points outside the margin, and on the correct side of the 
classifier, which correspond to = 0, we have 

y„(0 T x n + 0 O ) > 1 , 
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hence, /.„ = 0. That is, these points do not participate in the formation of the solution in (11.80). 

• We have X n ^ 0 only for the points that live either on the border hyperplanes or inside the margin 
or outside the margin but on the wrong side of the classifier. These comprise the support vectors. 

• For points lying inside the margin or outside but on the wrong side, > 0; hence, from (11.84), 
p, n — 0 and from (1 1.82) we get 

X n — C ■ 

• Support vectors which lie on the margin border hyperplanes satisfy = 0 and therefore /z„ can be 
nonzero, which leads to 


0 < X n < C. 


Remarks 11.5. 

• v-SVM: An alternative formulation for the SVM classification has been given in [104], where the 
margin is defined by the pair of hyperplanes, 


0 T x + 9 q — ±p, 

and p > 0 is left as a free variable, giving rise to the v-SVM; v Controls the importance of p in 
the associated cost function. It has been shown [23] that the v-SVM and the formulation discussed 
above, which is sometimes referred to as the C-SVM, lead to the same solution for appropriate 
choices of C and v. However, the advantage of v-SVM lies in the fact that v can be directly related to 
bounds concerning the number of support vectors and the corresponding error rate (see also [124]). 

• Reduced convex hull interpretation : In [58], it has been shown that for linearly separable classes, the 
SVM formulation is equivalent to finding the nearest points between the convex hulls formed by the 
data in the two classes. This resuit was generalized for overlapping classes in [33]; it is shown that 
in this case the v-SVM task is equivalent to searching for the nearest points between the reduced 
convex hulls (RCHs) associated with the training data. Searching for the RCH is a computationally 
hard task of combinatorial nature. The problem was efficiently solved in [73-75,122], who came up 
with efficient iterative schemes to solve the SVM task, via nearest point searching algorithms. More 
on these issues can be obtained from [66,123,124]. 

• l\ Regularized versions : The regularization term, which has been used in the optimization tasks 
discussed so far, has been based on the lo norm. A lot of research effort has been focused on using 
l\ norm regularization for tasks treating the linear case. To this end, a number of different loss 
functions have been used in addition to the squared error, hinge loss, and e-insensitive versions, 
as for example the logistic loss. The solution of such tasks comes under the general framework 
discussed in Chapter 8. As a matter of fact, some of these methods have been discussed there. 
A related concise review is provided in [132]. 

• Multitask learning: In multitask learning, two or more related tasks, for example, classifiers, are 
jointly optimized. Such problems are of interest in, for example, econometrics and bioinformatics. 
In [40], it is shown that the problem of estimating many task functions with regularization can be 
cast as a single task learning problem if a family of appropriately defined multitask kernel functions 
is used. 
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(A) 



(B) 


FIGURE 11.18 

(A) The training data points for the two classes (red and gray, respectively) of Example 11 .4. The full line is the 
graph of the obtained SVM classifier and the dotted lines indicate the margin, for C = 20. (B) The resuit for C = 1. 
For both cases, the Gaussian kernel with a =20 was used. 


Example 11.4. In this example, the performance of the SVM is tested in the context of a two-class 
two-dimensional classification task. The data set comprises /V = 150 points uniformly distributed in 
the region [—5, 5] x [—5, 5]. For each point x n = [x„ \ , x n ^] T , n = 1,2,... , N, we compute 

y n — 0.05x 3 j + 0.05x ; j , + 0.05x„.i + 0.05 + )?, 

where ;; stands for zero mean Gaussian noise of variance er^ = 4. The point is assigned to either of the 
two classes, depending on the value of the noise as well as its position with respect to the graph of the 
function 


f(x) = 0.05x 3 + 0.05x 2 + 0.05x + 0.05 
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in the two-dimensional space. That is, if x„2 > y n > the point is assigned to class u>\ ; otherwise, it is 
assigned to class o>2- 

The Gaussian kernel was used with a = 10, as this resulted in the best performance. Fig. 11.18A 
shows the obtained classiher for C — 20 and Fig. 11.18B that for C — I. Observe how the obtained 
classiher, and hence the performance, depends heavily on the choice of C. In the former case, the 
number of the support vectors was equal to 64, and for the latter it was equal to 84. 

11.10.3 PERFORMANCE OF SVMs AND APPLICATIONS 

A notable characteristic of the SVMs is that the complexity is independent of the dimensionality of the 
respective RKHS. This has an influence on the generalization performance; SVMs exhibit very good 
generalization performance in practice. Theoretically, such a claim is substantiated by their maximum 
margin interpretation, in the framework of the elegant structural risk minimization theory [28, 124, 128]. 

An extensive comparative study concerning the performance of SVMs against 16 other popular 
classihers has been reported in [77]. The results verify that the SVM ranks at the very top among 
these classihers, although there are cases for which other methods score better performance. Another 
comparative performance study is reported in [25]. 

It is hard to hnd a discipline related to machine learning/pattern recognition where SVMs and the 
concept of working in kernel spaces have not been applied. Early applications included data mining, 
spam categorization, object recognition, medical diagnosis, optical character recognition (OCR), and 
bioinformatics (see, for example, [28] for a review). More recent applications include cognitive radio 
(for example, [35]), spectrum cartography and network how prediction [9], and image denoising [13]. 

In [117], the notion of kernel embedding of conditional probabilities is reviewed, as a means to ad- 
dress challenging problems in graphical models. The notion of kernelization has also been extended in 
the context of tensor-based models [50,107,133]. Kernel-based hypothesis testing is reviewed in [49]. 
In [121], the use of kernels in manifold learning is discussed in the framework of diffusion maps. The 
task of analyzing the performance of kernel techniques with regard to dimensionality, signal-to-noise 
ratio, and local error bars is reviewed in [81]. In [120], a collection of articles related to kernel-based 
methods and applications is provided. 

11.10.4 CHOICE 0F HYPERPARAMETERS 

One of the main issues associated with SVM/SVRs is the choice of the parameter C, which Controls 
the relative influence of the loss and the regularizing parameter in the cost function. Although some 
efforts have been made in developing theoretical tools for the respective optimization, the path that 
has survived in practice is that of cross-valuation techniques against a test data set. Different values 
of C are used to train the model, and the value that results in the best performance over the test set is 
selected. 

The other main issue is the choice of the kernel function. Different kernels lead to different perfor¬ 
mance. Let us look carefully at the expansion in (11.17). One can think of k(x , x n ) as a function that 
measures the similarity between x and x n ; in other words, k(x. x„) matches x to the training sample 
x n . A kernel is local if k{x, x n ) takes relatively large values in a small region around x„. For example, 
when the Gaussian kernel is used, the contribution of k(x. x,,) away from x n decays exponentially 
fast, depending of the value of a 2 . Thus, the choice of a 2 is very crucial. If the function to be approxi- 
mated is smooth, then large values of a 2 should be employed. On the contrary, if the function is highly 
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varying in input space, the Gaussian kernel may not be the best choice. As a matter of fact, if for such 
cases the Gaussian kernel is employed, one must have access to a large number of training data, in 
order to fili in the input space densely enough, so as to be able to obtain a good enough approximation 
of such a function. This brings into the scene another critical factor in machine learning, related to the 
size of the training set. The latter is not only dictated by the dimensionality of the input space (curse of 
dimensionality) but it also depends on the type of variation that the unknown function undergoes (see, 
for example, [12]). In practice, in order to choose the right kernel function, one uses different kernels 
and after cross-validation selects the “best” one for the specific problem. 

A line of research is to design kernels that match the data at hand, based either on some prior 
knowledge or via some optimization path (see, for example, [29,65]). In Section 11.13 we will discuss 
techniques which use multiple kernels in an effort to optimally combine their individual characteristics. 

11.10.5 MULTICLASS GENERALIZATIONS 

The SVM classification task has been introduced in the context of a two-class classification task. The 
more general M-class case can be treated in various ways: 

• One-against-alh One solves M two-class problems. Each time, one of the classes is classified 
against all the others using a different SVM. Thus, M classifiers are estimated, that is, 

f m (x) = 0, m — 1,2,..., M , 

which are trained so that f m (x) > 0 for jc e a> m and f m (x) < 0 if x otherwise. Classification is 
achieved via the rule 


assign x in u>k : if k — argmax f m (jc). 

m 

According to this method, there may be regions in space where more than one of the discriminant 
functions score a positive value [124]. Moreover, another disadvantage of this approach is the so- 
called class imbalance problem; this is caused by the fact that the number of training points in one 
of the classes (which comprises the data from M — 1 classes) can be much larger than the points in 
the other. Issues related to the class imbalance problem are discussed in [124]. 

• One-against-one. According to this method, one solves binary classification tasks by con- 

sidering all classes in pairs. The final decision is taken on the basis of the majority rule. 

• In [129], the SVM rationale is extended in estimating simultaneously M hyperplanes. However, 
this technique ends up with a large number of parameters, equal to N(M — 1), which have to be 
estimated via a single minimization task; this turns out to be rather prohibitive for most practical 
problems. 

• In [34], the multiclass task is treated in the context of error correcting codes. Each class is associated 
with a binary code word. If the code words are properly chosen, an error resilience is “embedded” 
into the process (see also [124]). For a comparative study of multiclass classification schemes, see, 
for example, [41] and the references therein. 

• Divisiori and Clifford algebras: The SVM framework has also been extended to treat complex and 
hypercomplex data, both for the regression and the classification cases, using either division alge¬ 
bras or Clifford algebras [8]. In [125], the case of quaternion RKH spaces is considered. 
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A more general method for the case of complex-valued data, which exploits the notion of wiclely 
linear estimation as well as pure complex kernels, has been presented in [14]. In this paper, it is 
shown that any complex SVM/SVR task is equivalent to solving two real SVM/SVR tasks ex- 
ploiting a specific real kernel, which is generated by the chosen complex one. Moreover, in the 
classification case, it is shown that the proposed framework inherently splits the complex space into 
four parts. This leads naturally to solving the four-class task (quaternary classification), instead of 
the Standard two-class scenario of the real SVM. This rationale can be used in a multiclass problem 
as a split-class scenario. 


11.11 COMPUTATIONAL CONSIDERATIONS 

Solving a quadratic programming task, in general, requires 0(N 3 ) operations and 0(N 2 ) memory 
operations. To cope with such demands a number of decomposition techniques have been devised (for 
example, [22,55]), which “break” the task into a sequence of smaller ones. In [59,88,89], the sequen- 
tial minimal optimization (SMO) algorithm breaks the task into a sequence of problems comprising 
two points, which can be solved analytically. Efficient implementation of such schemes leads to an 
empirical training time that scales between O(N) and 0(N 23 ). 

The schemes derived in [74,75] treat the task as a minimum distance points search between reduced 
convex hulls and end up with an iterative scheme which projects the training points on hyperplanes. 
The scheme leads to even more efficient implementations compared to [59,88]; moreover, the minimum 
distance search algorithm has a built-in enhanced parallelism. 

The issue of parallel implementation is also discussed in [21]. Issues concerning complexity and 
accuracy are reported in [53]. In the latter, polynomial time algorithms are derived that produce ap¬ 
proximate Solutions with a guaranteed accuracy for a class of QP problems including SVM classifiers. 

Incremental versions for solving the SVM task which deal with sequentially arriving data have 
also appeared (for example, [26,37,100]). In the latter, at each iteration a new point is considered and 
the previously selected set of support vectors (active set) is updated accordingly by adding/removing 
samples. 

Online versions that apply in the primal problem formulation have also been proposed. In [84], an 
iterative reweighted LS approach is followed that alternates weight optimization with cost constraint 
forcing. A structurally and computationally simple scheme, named Primal Estimated sub-Gradient 
SOlver for SVM (PEGASOS), has been proposed in [105]. The algorithm is of an iterative subgradient 
form applied on the regularized empirical hinge loss function in (11.60) (see also Chapter 8). The 
algorithm, for the case of linear kernels, exhibits very good convergence properties and it finds an 
e-accurate solution in O ( j) iterations. 

In [43,56] the classical technique of cutting planes for solving convex tasks has been employed in 
the context of SVMs in the primal domain. The resulting algorithm is very efficient and, in particular 
for the linear SVM case, the complexity becomes of order O(N). 

In [126] the concept of the core vector machine (CVM) is introduced to lead to efficient Solutions. 
The main concept behind this method is that of the minimum enclosing ball (MEB) in computational 
geometry. Then a subset of points is used, known as the core set , to achieve an approximation to the 
MEB, which is employed during the optimization. 
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In [27], a comparative study between solving the SVM tasks in the primal and dual domains is 
carried out. The findings of the paper point out that both paths are equally efficient, for the linear as 
well as for the nonlinear cases. Moreover, when the goal is to resort to approximate Solutions, opting for 
the primal task can offer certain benefits. In addition, working in the primal also offers the advantage of 
tuning the hyperparameters by resorting to joint optimization. One of the main advantages of resorting 
to the dual domain was the luxury of casting the task in terms of inner products. However, this is 
also possible in the primal, by appropriate exploitation of the representer theorem. We will see such 
examples in Section 11.12.1 . 

More recently, a version of SVM for distributed processing has been presented in [42]. In [99], the 
SVM task is solved in a subspace using the method of random projections. 


11.12 RANDOM FOURIER FEATURES 

In Section 11.11, a number of techniques were discussed in order to address and cope with the com- 
putational load and the way this scales with the number of training data points, N, in the context of 
S VMs. Furthermore, at the heart of a number of kernel-based methods, such as the kernel ridge regres- 
sion (Section 11.7) and the Gaussian processes (to be discussed in Section 13.12) lies the inversion 
of the kernel matrix, which has been defined in Section 1 1.5.1. The latter is of dimension N x N and 
it is formed by all possible kernel evaluation pairs in the set of the training examples, i.e., k(x, ,Xj), 
i, j = 1, 2,..., N. Inverting an N x N matrix amounts, in general, to (D(N 3 ) algebraic operations, 
which gets out of hand as N increases. In this vein, a number of techniques have been suggested in 
an effort to reduce related computational costs. Some examples of such methods are given in [1,45], 
where random projections are employed to “throw away” individual elements or entire rows of the 
matrix in an effort to form related low rank or sparse matrix approximations. 

An alternative “look” at the computational load reduction task was presented in [91]. The essence 
of this approach is to bypass the implicit mapping to a higher (possibly infinite)-dimensional space 
via the kernel function, i.e., <f)(x) — k(-,x), which comprises the spine of any kernel-based method 
(Section 11.5). Instead, an explicit mapping to afinite-dimensional space, R°, is performed, i.e., 

x e X c R 7 1—>• z{x) e M^. 

However, the mapping is done in a “thoughtful’' way so that the inner product between two vectors in 
this D-dimensional space is approximately equal to the respective value of the kernel function, i.e., 

k(x, y ) = (</>(*), </>(y)> « z(x) T z(y), V*, yeX. 

To establish such an approximation, Bochner’s theorem from harmonic analysis was mobilized (e.g., 
[96]). To grasp the main rationale behind this theorem, without resorting to mathematical details, let us 
recall the notion of power spectral density (PSD) from Section 2.4.3. The PSD of a stationary process 
is defined as the Fourier transform of its autocorrelation sequence and there it was shown to be a real 
and nonnegative function. Furthermore, its integral is finite (Eq. (2.125)). Although the proof given 
there followed a rather “practical” and conceptual path, a more mathematically rigorous proof reveals 
that this resuit is a direct by-product of the positive definite property of the autocorrelation sequence 
(Eq. (2.116)). 
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In this line of thought, everything that was said above for the PSD is also valid for the Fourier 
transform of a shift-invariant kernel function. As is well known, a kernel function is a positive definite 
one (Section 11.5.1). Also, the shift-invariant property, i.e., k(x, y ) = k(x — y), could be seen as the 
equivalent of the stationarity. Then one can show that the Fourier transform of a shift-invariant kernel 
is a nonnegative real function and that its integral is finite. After some normalization, we can always 
make the integral equal to one. Thus, the Fourier transform of a shift-invariant kernel function can be 
treated and interpreted as a PDF, that is, a real nonnegative function that integrates to one. 

Let k(x — y) be a shift-invariant kernel and let p(oo) be its (multidimensional) Fourier transform, 
i.e., 

pico ) =- t f K(r)e~^‘° Tr clr. 

(2n) 1 JW 

Then, via the inverse Fourier transform, we have 


k(x — y)= / p(co)e' aT(x y ^dco. (11.89) 

J R' 

Having interpreted p(oo) as a PDF, the right-hand side in Eq. (11.89) is the respected expectation, i.e., 

K(x-y) = E, [e^ Tx e-^ T y\ 

One can get rid of the complex exponentials by mobilizing Euler’s formula 1 and the fact that the 
kernel is a real function. To this end, we deline 

Mo.bOO := \/2cos(w r jr + b), (11.90) 

where m is a random vector that follows p(a>) and b is a uniformly distributed random variable in 
[0, 27 t]. Then it is easy to show that (Problem 11.12) 

k(x - y) = E Mjb [z w , b (x)z Wjb OO], (11-91) 

where the expectation is taken with respect to p(a >) and p(b). The previous expectation can be approx- 
imated by 

1 D 

k(x -y)^^ (x)z Ml ,b t (y)> 

u ;=i 

where («,-, b,), i =■ 1.2,..., /X are i.i.d. generated samples from the respective distributions. 

We now have ali the ingredients to deline an appropriate mapping that builds upon the previous 
findings. 

• Step 1 : Generate D i.i.d. samples w, ~ p(oo) and bj ~ U(0, 2jt), i = 1,2,..., D. 

• Step 2: Perform the following mapping to p/ ; : 


13 Euler’s formula: e ^ = cos (j) + j si n 0. 
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2 v T t r 

x e X i —> zn(x) = J — ^cos(&>, x + b\), — cos (co D x + bo) J • 

The above mapping defines an inner product approximation of the original kernel, i.e., 

k(x - y)^zn(x) T zn(y), 


where, by definition, 



(02 • ■ ■ (Od 
bj ■ ■ ■ br> 


Once this mapping has been performed, one can mobilize any linear modeling method to work in a 
finite-dimensional space. For example, instead of solving the kernel ridge regression problem, the 
ridge regression task in Section 3.8 can be used. However, now, the involved matrix to be inverted is 
of dimensions D x D instead of N x N, which can lead to significant gains for large enough values 
of N. 

The choice of D is user-defined and problem dependent. In [91], related accuracy bounds have 
been derived. For within e accuracy approximation to a shift-invariant kernel, one needs only D — 
0(de~ 2 lndimensions. In practice, values of D in the range of a few tens to a few thousands, 
depending on the data set, seem to suffice to obtain significant computational gains compared to 
some of the more classical methods. 


Remarks 11.6. 

• Nystrom approximation: Another popular and well-known technique to obtain a low rank approxi¬ 
mation to the kernel matrix is known as the Nystrom approximation. 

Let /C be an N x N kernel matrix. Then select randomly q < N points out of N. Various sampling 
scenarios can be used. It turns out that there exists an approximation matrix /C of rank q, such that 

JF- _ JF- JF~— 1 V~T 

/V — /\snq/Vsq /V n q , 

where JC q is the respective invertible kernel matrix associated with the subset of the q points and 
JC n q comprises the columns of /C associated with the previous q points [131]. Generalizations and 
related theoretical results can be found in [36]. If the approximate matrix, K ,, is used in place of the 
original kernel matrix, /C, the memory requirements are reduced to 0{Nq) from N 2 , and operations 
of the order of 0(N 3 ) can be reduced to (D(Nq 2 ). 


11.12.1 ONLINE AND DISTRIBUTED LEARNING IN RKHS 

One of the major drawbacks that one encounters when dealing with Online algorithms in an RKHS is 
the growing memory problem. This is a direct consequence of the representer theorem (Section 11 .6), 
where the number of terms (kernels) in the expansion of the estimated function grows with N. Thus, in 
most of the Online versions that have been proposed, the emphasis is on devising ad hoc techniques to 
reduce the number of terms in the expansion. Usually, these techniques build around the concept of a 
dictionary, where a subset of the points is judicially selected, according to a criterion, and subsequently 
used in the related expansion. In this vein, various kernel versions of the LMS ([68,94]), RLS ([38, 
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127]), APSM and APA ([108-113]), and PEGASOS (Section 8.10.3, [62,105]) algorithms have been 
suggested. A more extended treatment of this type of algorithms can be downloaded from the book’s 
website, under the Additional Material part that is associated with the current chapter. 

In contrast, one can bypass the growing memory problem if the RFF rationale is employed. The 
RFF framework offers a theoretically pleasing approach, without having to resort to techniques to limit 
the growing memory, which, in one way or another, have an ad hoc flavor. In the RFF approach, all one 
has to do is to generate randomly the D points for <o and b and then use the Standard LMS, RLS, etc., 
algorithms. The power of the RFF framework becomes more evident when the Online learning takes 
place in a distributed environment. In such a setting, all the dictionary-based methods that have been 
developed so far break down; this is because the dictionaries have to be exchanged among the nodes 
and, so far, no practically feasible method has been proposed. 

The use of the RFF method in the context of Online learning and in particular in a distributed 
environment has been presented in [16,17], where, also, related theoretical convergence issues are 
discussed. Simulation comparative results reported there reveal the power and performance gains that 
one can achieve. Values of D in the range of a few hundred seem to suffice and the performance is 
reported to be rather insensitive on the choice of its value, provided that this is not too small. 


11.13 MULTIPLE KERNEL LEARNING 

A major issue in all kernel-based algorithmic procedures is the selection of a suitable kernel as well 
as the computation of its defining parameters. Usually, this is carried out via cross-validation, where a 
number of different kernels are used on a validation set separate from the training data (see Chapter 3 
for different methods concerning validation), and the one with the best performance is selected. It 
is obvious that this is not a universal approach; it is time consuming and definitely not theoretically 
appealing. The ideal would be to have a set of different kernels ( this also includes the case of the same 
kernel with different parameters) and let the optimization procedure decide how to choose the proper 
kernel, or the proper combination of kernels. This is the scope of a research activity, which is usually 
called multiple kernel learning (MKL). To this end, a variety of MKL methods have been proposed to 
treat several kernel-based algorithmic schemes. A complete survey of the field is outside the scope of 
this book. Here, we will provide a brief overview of some of the major directions in MKL methods that 
relate to the content of this chapter. The interested reader is referred to [48] for a comparative study of 
various techniques. 

One of the first attempts to develop an efficient MKL scheme is the one presented in [65], where 
the authors considered a linear combination of kernel matrices, that is, K = i a m K m ■ Because we 
require the new kernel matrix to be positive definite, it is reasonable to impose in the optimization task 
some additional constraints. For example, one may adopt the general constraint K > 0 (the inequality 
indicating semidefinite positiveness), or a more striet one, for example, a,„ > 0, for all m = 1,..., M. 
Furthermore, one needs to bound the norm of the final kernel matrix. Hence, the general MKL SVM 
task can be cast as 


minimize with respect to K 
subject to 


co c (K), 

K> 0, 

trace[/f} < c, 


(11.92) 
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where a>c(K) is the solution of the dual SVM task, given in (1 1.75) — (11.77), which can be written in 
a more compact form as 


a>c ( K) = max 
A. 


X T 1 


1 

2 


X T G(K)X : 0 < Xi < C, X T y = 0 


with each element of G (K) given as [G( K)]i, j = [K]ijyiyj', X denotes the vector of the Lagrange 
multipliers and 1 is the vector having all its elements equal to one. In [65], it is shown how (1 1.92) can 
be transformed to a semidefinite programming optimization (SDP) task and solved accordingly. 

Another path that has been exploited by many authors is to assume that the modeling nonlinear 
function is given as a summation. 


M M 

f (%) = ^ ' &m fm A') — ^ ’ Clm {.fm ? (*, At)) H m A b 

m= 1 m= 1 

M N 

— ^ ' y ^ @m,n&m (x , X n ) + b , 
m= 1 n= 1 

where each one of the functions, / m , m=l,2,...M, lives in a different RKHS, H m . The respec- 
tive composite kernel matrix, associated with a set of training data, is given by K — a m K m , 

where K\,..., Km are the kernel matrices of the individual RKHSs. Hence, assuming a data set 
{(y„,x„), n=\, ... N }, the MKL learning task can be formulated as follows: 

N 

m > n ££(>’„, f(x n )) + XQ(f), (11.93) 

f n= 1 

where C represents a loss function and fi (/) the regularization term. There have been two major trends 
following this rationale. The first one gives priority toward a sparse solution, while the second aims at 
improving performance. 

In the context of the first trend, the solution is constrained to be sparse, so that the kernel matrix is 
computed fast and the strong similarities between the data set are highlighted. Moreover, this rationale 
can be applied to the case where the type of kernel has been selected beforehand and the goal is to 
compute the optimal kernel parameters. One way (e.g., [4,116]) is to employ a regularization term 

of the form ST(/) = fl mll/mllia m ) , which has been shown to promote sparsity among the set 

{«!,... « vt L as it is associated to the group LASSO, when the squared error loss is employed in place 
of C (see Chapter 10). 

In contrast to the sparsity promoting criteria, another trend (e.g., [30,63,92]) revolves around the ar- 
gument that, in some cases, the sparse MKL variants may not exhibit improved performance compared 
to the original learning task. Moreover, some data sets contain multiple similarities between individual 
data pairs that cannot be highlighted by a single type of kernel, but require a number of different ker- 
nels to improve learning. In this context, a regularization term of the form Q,(f) = Ylm =l a m II fm II h 
is preferred. There are several variants of these methods that either employ additional constraints 
to the task (e.g., a m = 1) or define the summation of the spaces a little differently (e.g., 

/(■) = a~ fm ('))■ F° r example, in [92] the authors reformulate (11.93) as follows: 
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minimize with respect to a J(a), 

M 

subject to E &m — 1? 

m =1 


(11.94) 


where 



The optimization is cast in RKHSs; however, the problem is always formulated in such a way so that 
the kernel trick can be mobilized. 


11.14 NONPARAMETRIC SPARSITY-AWARE LEARNING: ADDITIVE MODELS 


We have already pointed out that the representer theorem, as summarized by (11.17), provides an 
approximation of a function, residing in an RKHS, in terms of the respective kernel centered at the 
points x\,... ,x n ■ However, we know that the accuracy of any interpolation/approximation method 
depends on the number of points, N. Moreover, as discussed in Chapter 3 , how large or small N 
needs to be depends heavily on the dimensionality of the space, exhibiting an exponential dependence 
on it (curse of dimensionality); basically, one has to fili in the input space with “enough” data in 
order to be able to “learn” with good enough accuracy the associated function. In Chapter 7, the 
naive Bayes classifier was discussed; the essence behind this method is to consider each dimension 
of the input random vectors, x e M 7 , individually. Such a path breaks the problem into a number l of 
one-dimensional tasks. The same idea runs across the so-called additive models approach. 

According to the additive models rationale, the unknown function is constrained within the family 
of separable functions, that is. 



additive model, 


(11.95) 


where x — [xi,,xi] T . Recall that a special case of such expansions is the linear regression, where 
f(x) = 0 T x. 


We will further assume that each one of the functions, /z,- (•), belongs to an RKHS, H,-, defined by a 
respective kernel, zc, (•, •), 


Kj (•,•): M x R 


14 Recall from the discussion in Section 1 1.10.4 that the other factor that ties accuracy and N together is the rate of variation of 
the function. 
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Let the corresponding norm be denoted as || • ||,. For the regularized total squared error cost [93], the 
optimization task is now cast as 


minimize with respect to / 


s.t. 


x N l 

9 X ( yn ~ 1 X H /? '' II* 


l 

f(x) = Y,hita)- 

i =1 


(11.96) 

(11.97) 


If one plugs (11.97) into (11.96) and following arguments similar to those used in Section 11.6, it is 
readily obtained that we can write 


N 

*«(■) = X 0i -»' Ci( -’ je «'-» ) ' C 11 - 98 ) 

n= 1 

where x^ n is the ith component of x n . Moving along the same path as that adopted in Section 11.7, the 
optimization can be rewritten in terms of 

Oi = [di, 1, ■ ■ •. Oi,N] T , i — 1,2, ..., Z, 


as 


where 


{0/}|=i =argimn {ej }; =i J (6>i,... ,0;), 


■Wt,....*!):= 2 




1=1 


kY J yJ0jK, i e i 


/=l 


and /C/, i = 1,2,... ,1, are the respective N x N kernel matrices 


(11.99) 


ICr-= 


' Ki(Xi,l,Xi,l) ■■■ Ki(Xi,l,Xi'N)' 


-Ki(Xi,N,Xi,l) ■■■ Ki{Xi, N ,Xi,N)_ 


Observe that (11.99) is a (weighted) version of the group LASSO, defined in Section 10.3. Thus, the 
optimization task enforces sparsity by pushing some of the vectors 6 , to zero values. Any algorithm 
developed for the group LASSO can be employed here as well (see, for example, [10,93]). 

Besides the squared error loss, other loss functions can also be employed. For example, in [93] 
the logistic regression model is also discussed. Moreover, if the separable model in (11.95) cannot 
adequately capture the whole structure of /, models involving combination of components can be 
considered, such as the ANOVA model (e.g., [67]). 

Analysis of variance (ANOVA) is a method in statistics to analyze interactions among variables. 
According to this technique, a function f{x), x e IlY, Z > 1, is decomposed into a number of terms; 
each term is given as a sum of functions involving a subset of the components of x. From this point 
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of view, separable functions of the form in (11.95) are a special case of an ANOVA decomposition. 
A more general decomposition would be 



( 11 . 100 ) 


and this can be generalized to involve larger number of components. Comparing (1 1.100) with (11.2), 
ANOVA decomposition can be considered as a generalization of the polynomial expansion. The idea of 
ANOVA decomposition has already been used to decompose kernels, k(x . y), in terms of other kernels 
that are functions of a subset of the involved components of x and y, giving rise to the so-called 
ANOVA kernels [115]. 

Techniques involving sparse additive kernel-based models have been used in a number of applica- 
tions, such as matrix and tensor completion, completion of gene expression, kernel-based dictionary 
learning, network flow, and spectrum cartography (see [10] for a related review). 


11.15 A CASE STUDY: AUTH0RSHIP IDENTIFICATION 


Text mining is the part of data mining that analyzes textual data to detect, extract, represent, and eval- 
uate patterns that appear in texts and can be transformed into real-world knowledge. A few examples 
of text mining applications include spam e-mail detection, topic-based classification of texts, senti- 
ment analysis in texts, text authorship Identification, text indexing, and text summarization. In other 


words, the focus can be on detecting basic morphology idiosyncracies in a text that identify its author 


(authorship Identification), on identifying complexly expressed emotions (sentiment analysis), or on 
condensing redundant information from multiple texts to a concise summary (summarization). 

In this section, our focus will be on the case of authorship Identification (for example, [114]), which 
is a special case of text classification. The task is to determine the author of a given text, given a set 
of training texts labeled with their corresponding author. To fulfill this task, one needs to represent and 
act upon textual data. Thus, the first decision related to text mining is how one represents the data. 

It is very common to represent texts in a vector space, following the vector space model (VSM) [97]. 
According to the VSM, a text document, T , is represented by a set of terms w ,, 0 < i < k, leN, each 
mapped to a dimension of the vector t = [t\, t 2 , ■ ■ ■, tk\ T e K*. The elements, t,-, in each dimension 
indicate the importance of the corresponding term Wj when describing the document. For example, 
each wi can be one word in the vocabulary of a language. Thus, in practical applications, k can be very 
large, as large as the number of words that are considered sufficient for the specific task of interest. 
Typical examples of k are of the order of 10 5 . 

Widely used approaches to assign values to r, are the following: 

• Bag-of-words, frequency approach: The bag-of-words approach to text relies on the assumption 
that a text T is nothing more than a histogram of the words it contains. Thus, in the bag-of-words 
assumption, we do not care about the order of the words in the original text. What we care about is 
how many times term u>,- appears in the document. Thus, f,- is assigned the number of occurrences 
of w, in T. 
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• Bag-of-words, Boolean approach : The Boolean approach to the bag-of-words VSM model restricts 
the tj values, such that t, e {0, 1}; f; is assigned a value of 1 if w, is found at least once in T, 
otherwise t; = 0. 

The bag-of-words approaches have proven efficient, for example, in early spam filtering applications, 
where individual words were ciear indicators of whether an e-mail is spam. However, the spammers 
soon adapted and changed the texts they sent out by breaking words using spaces: the word "pilis” 
would be replaced by “pili s.” The bag-of-words approach then needed to apply preprocessing to 
deal with such cases, or even to deal with changes in different word forms (via word stemming or 
lemmatization) or spelling mistakes (by using dictionaries to correct the errors). 

An alternative representation, with high resilience to noise, is that of the character n-grams. This 
representation is based on character subsequences of a text of length n (also called the order of an 
n-gram). To represent a text using n-grams, we split the text T in (usually contiguous) groups of 
characters of length n, which are then mapped to dimensions in the vector space (bag-of-n-grams). 
Below we give an example on how n-grams can be extracted from a given text. 

Example 11.5. Given the input text T —A_fine_day_today, the character and word 2-grams (n = 2) 
would be as follows: 

• Unique character 2-grams: “A_”, “fi”, “in”, “ne”, “e_”, “_d”, “da”, “ay”, “y_”, “_t”, 

“to”, “od”. 

• Unique word 2-grams: “A_fine”, “fine_day”, “day_today”. 

Observe that “ay” and “da” appear twice in the phrase, yet only once in the sequence, since each 
n-gram is uniquely represented. Even though this approach has proven very useful [20], the sequence 
of the n-grams is stili not exploited. 

In the following, we describe a more complex representation, which takes into account the sequence 
of n-grams, and at the same time allows for noise. The representation is termed n-gram graph [46]. An 
n-gram graph represents the way on how n-grams cooccur in a text. 

Definition 11.4. If S = [,S'i, V,...}, S* ^ S/, for k l, k,l e N is the set of distinet n-grams ex¬ 
tracted from a text T, and Sj is the / th extracted n-gram, then G = {V, E, W} is a graph where V = S 
is the set of vertices v, E is the set of edges e of the form e = [iq, V 2 }, and W : E —> K is a function 
assigning a weight w e to every edge. 

To generate an n-gram graph, we first extract the unique n-grams from a text, creating one vertex for 
each. Then, given a user-defined parameter distance D, we consider the n-grams that are found within 
a distance D of each other in the text; these are considered neighbors and their respective vertices are 
connected with an edge. For each edge, a weight is assigned. The edge weighting function that is most 
commonly used in n-gram graphs is the number of cooccurrences of the linked (in the graph) n-grams 
in the text. 

Two examples of 2-gram graphs, with directed neighborhood edges, with D — 2 are presented in 
Figs. 11.19 and 11.20. 

Between two n-gram graphs, G' and G J , there exists a symmetric, normalized similarity function 
called value similarity (VS). This measure quantifies the ratio of common edges between two graphs, 
taking into account the ratio of weights of common edges. In this measure, each matehing edge e, hav- 
ing weights w' e and w ] e , in graphs G' and G', respectively, contributes to VS the amount max (ic^i g i \) ’ 
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FIGURE 11.19 

The 2-gram graph (n = 2), with D = 2 of the string “A fine day.” 



FIGURE 11.20 

The 2-gram graph (n = 2), with D = 2 of the string “today.” 


where | • | indicates cardinality with respect to the number of edges and 


VR(e) := 


min (w' e , w ] e ) 
max(it^, w J e ) 


( 11 . 101 ) 


Not matching edges have no contribution (consider that for an edge e G' (G J ) we define w' e — 
0 (tt)f = 0)). The equation indicates that the Value ratio (VR) takes values in [0, 1], and is symmet- 
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ric. Thus, the full equation for the VS is 



— ,e&G ' max(w‘,We) 

max(|G' |, \G’ |) ' 


VS(G', G J ) = 


( 11 . 102 ) 


The VS takes a value close to 1 for graphs that share many edges with similar weights. A value of VS = 


1 indicates a perfect match between the compared graphs. For the two graphs shown in Figs. 11.19 and 


11.20, the calculation of the VS returns a value of VS = 0.067 (verify it). 

For our case study, the VS will be used to build an SVM kernel classifier, in order to assign (classify) 
texts to their authors. We should stress here that the VS function is a similarity function, but not a 
kernel. Fortunately, the VS on the specific data set is an (e, y)-good similarity function as defined in 
[6, Theorem 3], and hence this allows its use as a kernel function, in line with what we have already 
said about string kernels in Section 11.5.2. 

It is interesting to note that given the VSM representation of a text, one can also employ any 
vector-based kernel on strings. However, there have been several works that use string kernels [69] to 
avoid any loss of information in the transformation between the original string and its VSM equivalent. 
In this example, we demonstrate how one can combine even a complex representation with the SVM 
via an appropriate kernel function. 

The data set we used to examine the effectiveness of this combination is a subset of the 
Reuter_50_50 Data Set from the UCI repository [5]. This data set contains 2500 training and 2500 
test texts from 50 different authors (50 training and 50 test texts per author). 

• First, we used the ngg2svm command line tool to represent the strings as n-gram graphs and gen¬ 
erate a precomputed kernel matrix file. The tool extracts are—by default—3-gram graphs (n = 3, 
D — 3) from training texts. It then uses the VS to estimate a kernel and calculates the kernel values 
over all pairs of training instances into a kernel matrix. The matrix is the output of this phase. 

• Given the kernel matrix, we can use the LibSVM Software [24] to design a classifier, without caring 
about the original texts, but only relying on precomputer kernel values. 

Then, a 10-fold cross-validation over the data was performed, using the “svmtrain” program in Lib¬ 
SVM. The achieved accuracy of the cross-validation in our example was 94%. 


PROBLEMS 

11.1 Derive the formula for the number of groupings O(NJ) in Cover’s theorem. 


Hint: Show first the following recursion: 


0(N +1,/) = 0(N, l) + 0(N, l - 1). 


15 


You can download the tool from https://github.com/ggianna/ngg2svm. 
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11.2 


11.3 

11.4 


11.5 


11.6 


11.7 

11.8 


11.9 

11.10 

11.11 

11.12 

11.13 


To this end, start with /V points and add an extra one. Show that the extra number of linear 
dichotomies is solely due to those, for the /V data point case, which could be drawn via the new 
point. 

Show that ii N — 2(1 +1), the number of linear dichotomies in Cover’s theorem is equal to 

2 2!+1 . 

Hint : Use the identity 


and recall that 



= 2 j 


( 2,i + l /2 n + 1 \ 
\n — i + 1 / \ n + i ) 


Show that the reproducing kernel is a positive definite one. 

Show that if k(-, •) is the reproducing kernel in an RKHS H, then 


EI = span{/r(-, jc), x e X}. 


Show the Cauchy-Schwarz inequality for kernels, that is, 

lk(x,.y)|| 2 <K(x,x)K(y,y). 


Show that if 


*■;(•, ■): X x X i—> M, i = 1,2, 

are kernels, then 

• k(x, y) = k\ (x, y) + K2(x, y) is also a kernel; 

• aK(x, _y), a > 0, is also a kernel; 

• k(x, y) = k\ (x, y)K2(x , y) is also a kernel. 

Derive Eq. (1 1.25). 

Show that the solution for the parameters 0 for the kernel ridge regression, if a bias term b is 
present, is given by 


~ K + CI 1 " 

0 


y 

1_ 

b 




where 1 is the vector with ali its elements being equal to one. Invertibility of the kernel matrix 
has been assumed. 

Derive Eq. (11.56). 

Derive the dual cost function associated with the linear €-insensitive loss function. 

Derive the dual cost function for the separable class SVM formulation. 

Derive the kernel approximation in Eq. (1 1.91). 

Derive the subgradient for the Huber loss function. 
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MATLAB® EXERCISES 

11.14 Consider the regression problem described in Example 11.2. Read an audio file using 
MATLAB®’s wavread function and take 100 data samples (use Blade Runner , if possible, 
and take 100 samples starting from the 100,000th sample). Then add white Gaussian noise at 
a 15 dB level and randomly “hit” 10 of the data samples with outliers (set the outlier values to 
80% of the maximum value of the data samples). 

(a) Find the reconstructed data samples using the unbiased kernel ridge regression method, 
that is, Eq. (11.27). Employ the Gaussian kernel with a — 0.004 and set C — 0.0001. Plot 
the fitted curve of the reconstructed samples together with the data used for training. 

(b) Find the reconstructed data samples using the biased kernel ridge regression method 
(employ the same parameters as for the unbiased case). Plot the fitted curve of the re¬ 
constructed samples together with the data used for training. 

(c) Repeat steps (a) and (b) using C = 10 -6 , 10 -5 , 0.0005, 0.001, 0.01, and 0.05. 

(d) Repeat steps (a) and (b) using a — 0.001, 0.003, 0.008, 0.01, and 0.05. 

(e) Comment on the results. 

11.15 Consider the regression problem described in Example 11.2. Read the same audio file as in 
Problem 11.14. Then add white Gaussian noise at a 15 dB level and randomly “hit” 10 of the 
data samples with outliers (set the outlier values to 80% of the maximum value of the data 
samples). 

(a) Find the reconstructed data samples obtained by the support vector regression (you can 
use libsvm 16 for training). Employ the Gaussian kernel with o = 0.004 and set e — 0.003 
and C = 1. Plot the fitted curve of the reconstructed samples together with the data used 
for training. 

(b) Repeat step (a) using C = 0.05, 0.1, 0.5, 5, 10, and 100. 

(c) Repeat step (a) using e = 0.0005, 0.001, 0.01, 0.05, and 0.1. 

(d) Repeat step (a) using a — 0.001, 0.002, 0.01, 0.05, and 0.1. 

(e) Comment on the results. 

11.16 Consider the two-class two-dimensional classification task described in the S VM example in the 
book and slides. The data set comprises N = 150 points, x n — [x„ i, x, U 2 ] T , n = 1, 2,..., N, 
uniformly distributed in [—5, 5] x [—5, 5]. For each point, compute 

y n =0.05x*j +0.05x1, +0.05x„j +0.05+ / 7 , 

where r/ denotes zero mean Gaussian noise of variance <r~ = 4. Then, if x „ : 2 > y„, assign it to 
o)i, and if x„ 2 < y>i, assign it to class a> 2 . 

(a) Plot the points [ x n 1 , x n 2 ] using different colors for each class. 

(b) Obtain the SVM classifier using libsvm or any other related MATLAB® package. Use 
the Gaussian kernel with a =20 and set C = 1. Plot the classifier and the margin (for the 
latter you can employ MATLAB®’s contour function). Moreover, find the support vectors 
(i.e., the points with nonzero Lagrange multipliers that contribute to the expansion of the 
classifier) and plot them as circled points. 


16 


http://www.csie.ntu.edu.tw/cjlin/libsvm/. 
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(c) Repeat step 11.16b using C — 0.5, 0.1, and 0.05. 

(d) Repeat step 11.16b using C — 5, 10, 50, and 100. 

(e) Comment on the results. 

11.17 Consider the authorship identification problem described in Section 11.15. 

(a) Using the training texts of the authors whose names start with “T” in the data set, perform 
10-fold cross-validation using the n-gram graph representation of the texts and the VS as 
a kernel function. 

(b) Using the same subset of the data set, create the vector space model representation of the 
texts and apply classification using a Gaussian kernel. Compare the results with the results 
of (i). 
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12.1 INTRODUCTION 

The Bayesian approach to parameter inference was introduced in Chapter 3. Compared to other meth- 
ods for parameter estimation we have covered, the Bayesian method adopts a radically different view- 
point. The unknown set of parameters are treated as random variables instead of as a set of fixed (yet 
unknown) values. This was a revolutionary idea at the time it was introduced by Bayes and later on by 
Laplace, as pointed out in Chapter 3. Even now, after more than two centuries, it may seem strange to 
assume that a physical phenomenon/mechanism is controlled by a set of random parameters. However, 
there is a subtle point here. Treating the underlying set of parameters as random variables, 0, we do not 
really imply a random nature for them. The associated randomness, in terms of the prior distribution 
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p(0), encapsulates our uncertainty about their values, prior to receiving any measurements/observa- 
tions. Stated differently, the prior distributiori represents our belief about the different possible values, 
although only one of them is actually true. From this perspective, probabilities are viewed in a more 
open-minded way, that is, as measures of uncertainty, as discussed in the beginning of Chapter 2. 

Recall that parameter learning from data is an inverse problem. Basically, all we do is deduce the 
“causes” (parameters) from the “effects” (observations). The Bayes theorem can be seen as an inversion 
procedure expressed in a probabilistic context. Indeed, given the set of observations, say, X, which are 
controlled by the unknown set of parameters, we write 


pixmpm 

P(X) 


P(0 \X) = 


All that is needed for the above inversion is to have a guess about p(0). This term has brought a lot of 
controversy in the statistical community for a number of years. However, once a reasonable guess of 
the prior is available, a number of advantages associated with the Bayesian approach emerge, compared 
to the alternative route. The latter embraces methods that view the parameters deterministically as con- 
stants of unknown values, and they are also referred to asfrequentist techniques. The term comes from 
the more classical view of probabilities as frequencies of occurrence of repeatable events. A typical 
example of this family of methods is the maximum likelihood approach, which estimates the values of 
the parameters by maximizing p(X\6); the value of the latter conditional probability density function 
(PDF) is solely controlled by the obtained observations in a sequence of experiments. 

This is the first of two chapters dedicated to Bayesian learning. We present the main concepts and 
philosophy behind Bayesian inference. We introduce the expectation-maximization (EM) algorithm 
and apply it in some typical machine learning parametric modeling tasks, such as regression, mixture 
modeling, and mixture of experts. Finally, the exponential family of distributions is introduced and the 
notion of conjugate priors is discussed. 


12.2 REGRESSION: A BAYESIAN PERSPECTIVE 


The Bayesian inference treatment of the linear regression task was introduced in Chapter 3. In the 


current chapter, we go beyond the basic definitions and reveal and exploit various possibilities that the 


Bayesian philosophy offers to the study of this important machine learning task. Let us first summarize 
the findings of Chapter 3 and then start building upon them. 

Recall the (generalized) linear regression task, as it was introduced in previous chapters, that is, 


K -1 



( 12 . 1 ) 


where y e K is the output random variable, x e M 7 is the input random vector, q e E is the noise distur- 
bance, 0 e PY" is the unknown parameter vector, and 


1 Recall our adopted notation: random variables and vectors are denoted with roman and their respected measured values/ob- 
servations with Times Roman fonts. 
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0(x) :=[0i(x),...,0 JC _ 1 (x),l] r , 


where <j>k ■ k — 1,..., K — 1, are some (fixed) basis functions. As we already know, typical examples 
of such functions can be the Gaussian function, splines, monomials, and others. We are given a set of 
N output-input training points, (y n , x n ), n — 1,2 ,..., N. In our current setting, we assume that the 
respective (unobserved) noise values n = 1,2,..., N, are samples of jointly Gaussian distributed 
random variables with covariance matrix E rj , that is, 



( 12 . 2 ) 


where r/ — [rji, r) 2 ,..., Vn] T - 


12.2.1 THE MAXIMUM LIKEUHOOD ESTIMATOR 


The maximum likelihood (ML) method was introduced in Chapter 3. According to the method, the 
unknown set of parameters are treated as a deterministic vector variable, 6 , which parameterizes the 
PDF that describes the output vector of observations 


y = <J>0 + r], 


(12.3) 


where 


<l> T (x i) 
<t> T (x 2 ) 


(12.4) 


<t> = 


_ (f) T (x N ) _ 


and 


y = [y\<y2, 

A simple replacement of X with <J> in (3.61) changes the ML estimate to 



(12.5) 


For the simple case of uncorrelated noise samples of equal variance cr“ (E v = o~I), Eq. (12.5) becomes 
identical to the least-squares (LS) solution 



( 12 . 6 ) 


A major drawback of the ML approach is that it is vulnerable to overfitting, because no care is taken of 
complex models that try to “learn” the specificities of the particular training set, as already discussed 
in Chapter 3. 
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12.2.2 THE MAPESTIMATOR 

According to the maximum a posteriori probability (MAP) method, the unknown set of parameters is 
treated as a random vector 0 and its posterior, for a given set of output observations, y, is expressed as 


P(0\y) = 


p(y\O)p(0 ) 

p(y) 


(12.7) 


where p(0) is the associated prior PDF. We have eliminated from the notation the dependence on X, to 
make it look simpler. We emphasize that the input set, X = {jc i,..., x^}, is considered fixed, so all the 
randomness associated with y is due to the noise source. Assuming both the prior and the conditional 
PDFs to be Gaussians, that is, 

p(6)=Af(0\0 o ,Zg) (12.8) 

and 

p(y\0)=tf(y\M,Er,), (12.9) 

where (12.2) and (12.3) have been used, the posterior p(0\y) turns out also to be Gaussian with mean 
vector 

Pe\y Eteij] = #o + <& T Z~ { {y — 4>0o). (12.10) 

Because the maximum of a Gaussian coincides with its mean, we have 


0map = E[0|j]. (12.11) 

In the present chapter’s appendix, an analytical proof of (12.10) is provided. It suffices to replace 
in (12.143) t —* y, z —> 0, A —► O, Z t \ z —> Z lp and Z z —> Zg. Note that the MAP estimate is a reg- 
ularized version of 0ml- Regularization is achieved via 0 o and Zg, which are imposed by the prior 
p(6). If one assumes Zg = cr| I, Z, t = a“/, and 0q = 0, then (12.10) coincides with the solution of the 
regularized LS (ridge) regression, 

0 MAP = (ii + \ T y , ( 12 . 12 ) 


where we have set X - 4 . We already know from Chapter 3 that the value of X is critical to the per- 

a e 

formance of the estimator with respect to the mean-square error (MSE) performance. The main issue 
now becomes how to choose a good value for /., or equivalently for Zg , Z v in the more general case. 
In practice, the cross-validation method (Chapter 3) is employed; different values of X are tested and 


- Because in this chapter many random variables will be involved, we explicitly state the name of the variable to which we refer 
in A/"(T, •). 

3 Because the appendix serves the needs of various parts of the book, each time involving different variables, one has to make 
the necessary notational substitutions. Note that the appendix can be downloaded from the book’s site. 

4 Recall from Section 3.8 that this is valid if either the data have been centered or the intercept (bias) is involved in the regular- 
izing norm term. 
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the one that leads to the best MSE (or some other criterion) is selected. However, this is a computation- 
ally costly procedure, especially for complex models, where a large number of parameters is involved. 
Moreover, such a procedure forces us to use only a fraction of the available data for training, to reserve 
the rest for testing. The reader may wonder why we do not use the training data to optimize with respect 
to both the unknown parameter vector 6 and the regularization parameter. Let us consider as an exam- 
ple the simpler case of ridge regression, for centered data (Oo — 0). The cost function comprises two 
terms, one that is data dependent and measures the misfit, and one that depends only on the unknown 
parameters, 

J(0,k) = \\y-^0\\ 2 +k\\0\\ 2 . (12.13) 

It is obvious that the only value of k that leads to the minimum squared error ht over the training data 
set (empirical loss) is k — 0. Any other value of k would resuit in an estimate of 0 which scores larger 
values of the squared error term; this is natural, because for k ^ 0 the optimization has to take care of 
the extra regularizing term, too. It is only when test data sets are employed, where values of k ^ 0 lead 
to an overall decrease of the MSE (not the empirical one). 


12.2.3 THE BAYESIAN APPROACH 

The Bayesian approach to regression attempts to overcome the previously reported drawbacks, which 
are associated with the overfitting. All the involved parameters can be estimated on the training set. 
In this vein, the parameters will be treated as random variables. At the same time, because the main 
task now becomes that of inferring the PDF that describes the unknown set of parameters, instead of 
obtaining a single vector estimate, one has more information at her/his disposal. Having said that, it 
does not mean that Bayesian techniques are necessarily free from cross-validation; this will be needed 
to assess their overall performance. We will comment further on this in the Remarks at the end of 
Section 12.3. 

As we know, the starting point is the same as that for MAP, and in particular (12.7). However, 
instead of taking just the maximum of the numerator in (12.7), we will make use of p(0\y) as a whole. 
Most of the secrets here lie in the denominator p(y), which is basically the normalizing constant, 


P(y) = f p(y\0) P (0)d0. 


(12.14) 


As we will soon see, there is much more information hidden in p{y) that goes beyond the need of just 
computing p(0\y). The difficulty with (12.14) is that, in general, the evaluation of the integral cannot 
be performed analytically. In such cases, one has to resort to approximate techniques to obtain the 
required information. To this end, a number of approaches are available, and a large part of this book 
is dedicated to their study. More specifically, the following methods have been proposed and will be 
considered: 

• the Laplacian approximation method, presented in Section 12.3; 

• the variational approximation method, presented in Section 13.2; 

• the variational bound approximation method, presented in Section 13.8; 

• Monte Carlo techniques for the evaluation of the integral, which are discussed in Chapter 14; 

• message passing algorithms, to be discussed in Chapter 15. 
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For the case under study in this section, where p(y\0) and p(0) are both assumed to be Gaussians, 
p(y) can be evaluated analytically; it turns out that the joint distribution p(y. 0) is also Gaussian and 
hence the marginal p(y) is Gaussian as well. All these are shown in detail in the appendix of this 
chapter. Indeed, if we set in (12.146) and (12.15 1) of the appendix z —> 0 , t -> y, and —> O, it turns 
out that for the regression model of (12.3) and the prior PDF in (12.8) as well as the noise model of 
(12.2), we obtain 

p(y) = Af (y|4>0o, E,, + Or e $ r j . (12.15) 

Moreover, the posterior p(9\y) is also Gaussian, 

p(.9\y)=Af($\n e \ y ,Ze\y), (12.16) 

where fi S \ y is given by (12.10) and the covariance matrix results from (12.147), after the appropriate 
notational substitutions, that is, 


r 0 | V = (r- | + <h r r- 1 ch) ‘. (12.17) 

The posterior PDF in (12.16) encapsulates our knowledge about 0, after the observations y have been 
obtained. Hence, our uncertainty about 0 has been reduced, which is the main reason that (12.16) is 
different from the prior PDF in (12.8); the latter represents only our initial guess. The covariance matrix 
in (12.17) provides the information about our uncertainty with respect to 0. If the Gaussian in (12.16) is 
very broad around its mean /igy, it indicates that in spite of the reception of the observations stili much 
uncertainty about 0 remains. This can be due (a) to the nature of the problem, for example, high noise 
variance, as this is conveyed by E rj , (b) to the number of observations, N , which may not be enough, 
and/or (c) to modeling inaccuracies, as this is conveyed by <t> in (12.17). The opposite comments are 
in order if the posterior PDF is sharply peaked around its mean. 

As we have already stated in Chapter 3, the Bayesian philosophy provides the means for a direct 
inference of the output variable, which in many applications is the quantity of interest; given the input 
vector, the task is to predict the output. In such cases, estimating a value for the unknown 0 is only the 
means to an end. To formulate the prediction task directly, without involving 0, one has to integrate the 
contribution of 0. Having learned the posterior p(0\y), given a new input vector x, for the regression 
model in (12.1), the conditional PDF of the output variable, y, given the set of observations, is written 
as 

p(y|x,y) = J p(y\x,0)p(0\y)d0. (12.18) 

Note that we have used p(y | jc , y, 0) = p(y |jc, 0) because y is conditionally independent of y given the 
value of 0. As already stated, strictly speaking, the posterior should have been denoted as p(0\y; X) 
to indicate the dependence on the input training samples. However, the dependence on X has been 
suppressed to unclutter notation. 

In the sequel, and in order to simplify algebra and focus on the concepts, we assume that the noise 
model in (12.2) is such that E n = cr~I and also Eg — <jgl for the prior PDF in (12.8). Then we have 

p(y\x,0)=my\0 T <Kx),tf), 
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and (12.17) and (12.10) for the posterior covariance matrix and mean, respectively, become 



-i 


(y - <D0 O ) • 


(12.19) 


( 12 . 20 ) 


The integration in (12.18) can now be carried out analytically as in (12.136) and (12.137) and 


using (12.150) and (12.151) in the appendix, with z—>0, t-*-y, A —»■ (f > T , S z 

Hg\ y , E t \ z —>■ and we obtain 



( 12 . 21 ) 


where 


l l y — <t> T (x)fi e \y, 

a y = + V ( x )Zo\ y <l>(x) 


( 12 . 22 ) 



(12.23) 


Hence, given x one can predict the respective value of y using the most probable value, that is, /i. v in 
(12.22). Note that the same prediction value would resuit via the MAP estimate in (12.10) (or (12.12), 
if 0 O = 0, also obtained via the ridge regression task). Have we then gained anything extra by adopting 
the Bayesian approach? The answer is in the affirmative. More Information concerning the predicted 
value is now available, because (12.23) quantifies the associated uncertainty. 

To investigate (12.23) further, let us simplify it by adopting the following approximation: 



n =l 


or 


<P T <P~ NR^, 


(12.24) 


where is the autocorrelation matrix of the random vector 0(x). Employing (12.24) into (12.23) 
leads to 



(12.25) 
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which for large enough N becomes 

a y - rf (i + ^0' (x)R^'<l)(x)'j . 

Thus, for a large number of observations, oH —> er“, and our uncertainty is contributed by the noise 
source, which cannot be reduced anymore. For smaller values of N . there is extra uncertainty associated 
with the parameter 0, measured by <tj in (12.25). 

So far in this section, we dealt with Gaussians, which led to tractable and analytically computed 
integrals. Are there ways to attack more general cases? Moreover, even in the case of Gaussian PDFs, 
we have assumed the covariance matrices Eg , E tl to be known. In practice, they are not. Even if one 
assumes that E rj can be experimentally measured, there stili remains Eg. Can one select the related 
parameters via an optimization process? If the answer is yes, can this optimization be carried out on 
the training set, or one would necessarily run into problems similar to the ones we faced with the 
regularization approach? We will indulge in all these challenges in the sections to follow. 

Remarks 12.1. 

• The MAP estimator is sometimes referred to as Type I estimator, to be distinguished from the Type 
II estimation method, which will be discussed in Remarks 12.2, in the next section. 

• The posterior mean in (12.10) can be met in different variants, which are obtained via the application 
of the matrix inversion lemmas given in Appendix A. 1 . In the chapter’s appendix, it is shown that 

(Eq. (12.152)) 

Aeiy = (^e -1 + <b 7 'r“ 1 d>)” 1 (^r-V + r-^o) (12.26) 

or (Eq. (12.148)) 

fig b , = 0 o + Eg® 7 + cbl^O 7 )” 1 (y - ®0 O ) ■ (12.27) 

Also, using Woodbury’s identity from Appendix A.l, we can readily see that 

Eg\y = E e - Eg® 7 (e,, + OI^cF 7 ’) * <t>Eg. (12.28) 

In practice, one uses the most computationally convenient form, depending on the dimensionality 
of the involved matrices to invert the one of lower dimension. 

Example 12.1. This example demonstrates the prediction task summarized in (12.22) and (12.23). 
Data are generated based on the following nonlinear model: 

y,i — Oo + 0\x n + 02xl + e^xl + 0 5 x 5 n +r]n, n = 1,2,..., N, 

where r],, are i.i.d. noise samples drawn form a zero mean Gaussian with variance oH. Samples x n are 
equidistant points in the interval [0, 2]. The goal of the task is to predict the value y given a measured 
value x, using (12.22). The parameter values used to generate the data were equal to 


00 — 0.2, 6>i = — 1, 02 = 0.9, 03 =0.7, 0 5 = -0.2. 
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FIGURE 12.1 

Each one of the red points (y, x) indicates the prediction (y) corresponding to the input value (x). The error bars are 
dictated by the computed variance, 07 . The mean values used in the Gaussian prior are equal to the true values of 
the unknown model. (A) = 0.05, N = 20, aj = 0.1. (B) rf = 0.05, N = 500, 0 “ = 0.1. (C) rf = 0.15, N = 500, 

Og = 0.1. Observe that the larger the data set is, the better the predictions are and the larger the noise variance is, the 
larger the error bars become. 


(a) In the first set of experiments, a Gaussian prior for the unknown 0 was used with mean 6 0 equal 
to the previous true set of parameters and Sg =0.1/. Also, the true model structure was used 
to construet the matrix <t>. Fig. 12.1 A shows the points (y, x) in red together with the error 
bars, as measured by the computed ay, for the case of N = 20 training points and oy = 0.05. 
Fig. 12. 1B demonstrates the obtained improvement when the training points are increased to 
N — 500, while keeping the values of the other two parameters unchanged. Fig. 12. 1C corresponds 
to the latter case, where the noise variance is increased to a~ = 0.15. 
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FIGURE 12.2 

In this set of figures the mean values of the prior are different from that of the true model. (A) aT = 0.05, N = 20, 
Ug = 0.1. (B) = 0.05, N = 20, Og = 2; observe the effect of using larger variance for the prior. (C) a^ = 0.05, 

N = 500, crg = 0.1; observe the effect of the larger training data set. (D) The points correspond to a wrong model. 


(b) In the second set of experiments, we kept the correct model, but the mean of the prior was given a 
different value to that of the true model, namely. 


0o = [-10.54,0.465,0.0087, -0.093, -0.004] 7 ’. 

Fig. 12. 2A corresponds to the case of = 0.05, N = 20, and ajj = 0.1. Note the improvement 
that is obtained when increasing o} ; = 2, shown in Fig. 12. 2B, while N and a~ remain the same 
as before; this is because the model takes into consideration our uncertainty about the prior mean 
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being away from the true value. Fig. 12.2C corresponds to cr~ — 0.05, N — 500, and Og = 0.1 and 
shows the advantage of using a large number of training points. 

(c) Fig. 12. 2D corresponds to the case where the adopted model for prediction is the wrong one, that 
is. 


y = 0q + 6\x + 0 2 x 2 + rj. 

The used values were a 2 = 0.05, N — 500, and Og = 2. Observe that once a wrong model has been 
adopted, one must not have “high expectations” for good prediction performance. 


12.3 THE EVIDENCE FUNCTION AND 0CCAIVTS RAZOR RULE 


In the previous section, we made a comment about the importance of the marginal PDF p(y). This 
section is fully dedicated to this quantity. In the notation used in (12.14), we did tacitly suppress the 
dependence on the adopted model. For example, the Gaussian assumption for the prior in (12.8) and for 
the conditional in (12.9) should have been reflected in the marginal as p(y: <j >, E rj , Eg), because differ¬ 
ent Gaussians, different basis functions, and different orders K of the model can be used. Furthermore, 
non-Gaussian PDFs can also be adopted. In a more general setting, let us make the dependence on 
the model explicit as p(y\M.i). Assuming the choice of a model to be random, mobilizing the Bayes 
theorem once more, we have 


where 


P{Mi\y) = 


P(Mj)p(y\Mj) 

p(y) 


P(y) = Y P(Mi)p(y\Mi), 


(12.29) 


(12.30) 


and P{M.i) is the prior probability of A4i ; P(A4j) provides a measure of the subjective prior over 
all possible models, which expresses our guess on how plausible a model is with respect to alternative 
ones, prior to the data arrival. Because the denominator in (12.29) is independent of the model, one 
can obtain the most probable model, after observing y, by maximizing the numerator. If one assigns to 
all possible models equal probabilities, then detecting the most probable model under the given set of 
observations becomes a task of maximizing p(y\j\4j) with respect to the model, A4/. This is the reason 
that this PDF is known as the evidence Junction for the model or simply as the evidence. In practice, we 
content ourselves with using the most probable model, although an orthodox Bayesian would suggest 
to average all obtained quantities over all possible models, as in (12.30). In an ideal Bayesian setting, 
one does not choose among models; predictions are performed by summing over all possible models, 
each one weighted by the respective probability. However, in many practical problems we may have 
reasons to suggest that the evidence function is strongly peaked around a specific model; after all, such 
an assumption may simplify the task considerably. 

From a mathematical formulation’s point of view, each model is expressed in terms of a set of 
nonrandom (deterministic) parameters. To get the model that maximizes the evidence is equivalent to 
maximizing with respect to these parameters. For example, let us assume that in the regression task 
we adopt a Gaussian distribution for the noise and a Gaussian prior for the regression task random 





606 


CHAPTER 12 BAYESIAN LEARNING: INFERENCE AND THE EM ALGORITHM 



p(%. Mi) 


P 


7 



9 


FIGURE 12.3 


The posterior peaks around the value #map and the posterior PDF can be approximated by p(0map|3’; Mi) over an 
interval of values equal to A9g\ y . 

parameters, 0. Then this pair of Gaussians comprises our model and the parameters that describe it are 
the respective covariance matrices for the two Gaussians as well as the mean value of the prior. To 
distinguish from the random parameters, 0, which describe the original learning task (e.g., regression), 
these additional parameters are often called hyperparameters. Later on, in Chapter 13, we will see that 
the hyperparameters can also be treated as random variables. 

We now turn our attention to what is hidden behind the optimization of p(y\MI ,) with respect 
to different models. Before we proceed, it is worth making a comment. A superficial first look may 
lead one to think whether this is any different from maximizing the likelihood p(y: 0), as done in 
Section 12.2.1. As a matter of fact, the two cases belong to two different worlds. ML maximizes with 
respect to a single (vector) parameter within an adopted model, and this is the weak point that makes 
ML vulnerable to overfitting. Maximizing the evidence is an optimization task with respect to the 
model itself, a wise alternative that guards us against overfitting, as we explain next. 

From (12.14) we have 



evidence function. 


(12.31) 


Let us assume for simplicity that 9 is a scalar, 9 e R, and that the integrand in (12.31), which according 
to the Bayes theorem is analogous to the posterior p(9\y. Ad/), peaks around a value; this is obviously 
the value that would resuit as the MAP estimate, 0map- Fig. 12.3 illustrates the respective graphs. Thus, 
(12.31) can be approximated by 


p(y\Mj) ~ p(y|Ad/, 0MAp)p(^MAp|Ad/)A0 0 | r 


(12.32) 
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To get a better feeling for each one of the factors involved in (12.32), let us also assume that the prior 
PDF is (almost) uniform with a width equal to A 9. Then (12.32) is rewritten as 

p(y\Mi) - P (y\Mi, §MAr)^r. (12.33) 

Ac7 

The first factor in the product on the right-hand side in (12.33) coincides with the likelihood function 
at its optimal value, because for this case of uniform prior, 0 map = 0ml- In other words, this factor 
provides us with the best fit that model Ad, can achieve on the given set of observations. However, 
now, in contrast to the ML method, the evidence function also depends on the second factor, ■ As 
it has been pointed out in the insightful papers [13,27,28], this term accounts for the complexity of the 
model, and it is named the Occam factor for obvious reasons. Let us elaborate on this a bit more by 
following the reasoning given in [28]. 

The Occam factor penalizes those models which are finely tuned to the received observations. As an 
example, if two different models A i/ and Ai / have a similar range of values for their prior PDFs, then 
if, say, Afyjy(Aii) <5jC A6g\ y (A4 j), then Ai / will be penalized more; only a small range of values for 
0 survive (i.e., correspond to high probability values) after the reception of y. So, if this fine-tuned (to 
the data) model Aii resulted in a large value of the ML term, it would not be certain that the evidence 
would be maximized for it, because the Occam factor would be small. Which model, between the two, 
finally wins it depends on the product of the two involved terms. Soon we will see that the Occam term 
is also related to the number of parameters; that is, to the complexity of the adopted model. 


LAPLACIAN APPROXIMATION AND THE EVIDENCE FUNCTION 

To investigate the evidence function for the general multiparameter case, we will employ the method of 
Laplacian approximation of a PDF. This is a general methodology that approximates any PDF locally 
in terms of a Gaussian one. To this end, detine 


g(0) = ln(p(y\Aii,0)p(6\Aii)). 

Use Taylor’s expansion around 0map and keep terms up to the second order, 

rdg(O) 


(12.34) 


g(0) = g(0 map) + (0 — 0map) 


90 


1 

+ -(0 — map) 


1 


■d 2 g(0) 


90- 


#=0MAP 


0=#MAP 

(0 - 0map) 


= g(#MAp) - ~j(0 - 0MAP) ? Z *(0-0MAP), 


(12.35) 


where 


r _i _ = _d 2 g(0) 


90- 


0=0 MAP 


5 


Similarly, to obtain the Laplacian approximation of a general PDF, p(x), we set g(x) = ln/?(x). 
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which leads to the approximation 


p(y\Mi,0)p(0\Mi) - p(y\Mi,d M Ap)p(OMAp\Mi)x 

exp -(0 ~ Omap) T Z~ l (0 ~ #map) 

Plugging (12.36) into the integral of (12.31) we obtain 

p{y\Mi) = p(y\Mi,0MAp)p(0MAp\Mi)(2n)^\i:\ l/2 , 

and taking the logarithms we have 


K i 

ln p(y\Mi) = ln p(y\Mi, 0map) + ln p (6map\ Mi) + — ln(2n) + - ln \S \. 

- V ■ ^ - v ■ ^ ^ 


Evidence 


Best likelihood fit 


Occam factor 


(12.36) 


( 12 . 37 ) 


(12.38) 


The direct dependence of the Occam term on the complexity (number of basis functions) of the adopted 
model is now readily spotted. Moreover, the complexity-related Occam term depends on the prior PDF 
and the second derivatives (via S ) of the posterior PDF, too; that is, it depends on how “sharp” the shape 
of the latter is in the K -dimensional space. In other words, the covariance term provides the “error 
bar” information. Moreover, its determinant does depend on K, too. That is, the dependence on the 
complexity term K is more involved than what a naive look at Eq. (12.38) suggests (see Remarks 12.2, 
concerning the BIC criterion). Hence, in a single equation, besides the number of parameters and 
the associated best-fit term, the evidence also takes into account information related to the associated 
variance; maximizing the evidence leads to the best tradeoff. Fig. 12.4 illustrates the essence behind 



n 

Too simple 



Best tradeoff 



Too complex 



rn. . 

... 


V i l \ 


y 

space of the observation sets 


FIGURE 12.4 


Too simple models can explain well a very small range of data. On the other hand, too complex models can explain 
a wide range of data; however, they do not provide any confidence because they assign low probability to all data 
sets. For the observation set, y , the evidence is maximized for the model with intermediate complexity. 
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the evidence maximization for model selection. If the model is too complex, it can fit well a wide range 
of data sets, and because p{y\M.i) has to integrate to one, its value for any value of y is expected to 
be low. The opposite is true for models that are too simple; such models can model well some data sets 
but not a wide range of them, and consequently, the evidence function peaks sharply around a value 
in the space of observation sets. Thus, selecting a data set at random it is rather unlikely that this has 
been generated by such a model. Having said that, it is important to emphasize, once more, the Occam 
term does not depend solely on the number of parameters; hence, complexity here should be interpreted 
in a more “open-minded” way. This robustness against overfitting, which is intrinsic in the Bayesian 
inference approach, is the consequence of integrating the parameters for any specific model in (12.31); 
this integration penalizes models of high complexity because such models can model a large range of 
data. 

Historically, the OccanTs razor rule in its Bayesian interpretation was first demonstrated in [13] and 
later on in [27,43], although the foundations go back to the pioneering work of Sir Herald Jeffrey in 
the 1930s [23]. Two insightful reviews on the Bayesian inference approach that are well worth reading 
are given in [22,26]. 

Returning to (12.14) and assuming for simplicity that 

p{y\0)=N{y\M,ap'} and p(fi) = U(e\0 O , ap) , (12.39) 

we can express the evidence as p(y \ a^, rrj), which, for this case, turns out to be Gaussian (appendix 
of this chapter); thus, it is available in closed form. Hence for this specific case, the model space is 
described via the hyperparameters cr r j, <jg and maximization of the evidence with respect to these (un- 
known) model parameters can take place iteratively, for the given set of observations stacked in y (see, 
for example, [27]). However, in general we do not have the ability to express the evidence in closed 
form. The EM algorithm, which is described in Section 12.4, is a popular way and a powerful tool 
to this end. One could also resort to the Laplacian approximation to approximate the involved PDFs 
as Gaussians, but this approximation turns out not always to be a good choice; furthermore, in high- 
dimensional parameter spaces, the computations of the second-order derivatives and the determinant 
can become burdensome [3]. 

Finally, let us make a final comment concerning the Laplacian approximation. In the discus- 
sion above, our goal was to get an approximation of the integral (normalizing constant/evidence) of 
p(y\M.i, 0)p(0\JAi). However, if our interest were to approximate the PDF itself, we should be care- 
ful in selecting the normalizing constant, which by the nature of the Gaussian function leads to 

p(y\Mi,0)p(6\Mi)~ (27r) ^/ 2 | X -|i /2 exp ~^map) T ^~ l (0 ~<Wp)^ ■ 

This is also the case for the Laplacian approximation for any PDF p(-). 

Remarks 12.2. 

• In the Bayesian approach, one makes ali the modeling assumptions explicit, and it is then left to the 
rules of probability theory to provide the answers. One does not have to “worry” about the choice 
of an optimizing criterion, where different criteria lead to different estimators, and there is not an 
objective, systematic way to decide which criterion is best. On the other hand, in the Bayesian 
approach one has to make sure to select the prior that explains the data in the best possible way. 
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• The choice of the prior PDF is very critical in the performance of Bayesian methods and must be 
carried out in such a way to encapsulate prior knowledge as fully as possible. In practice, different 
alternatives can be adopted [ 3 ]. 

- Subjective priors. According to this path, we choose the prior p(0) to make the manipulation of 
the integration tractable, by employing conjugate priors (Section 3 . 11 . 1 ) within the exponential 
family of PDFs. This family of PDFs will be presented in Section 12 . 8 . 

- Hierarchical priors. Each one of the components of (4 , k — 0 , 2 , ..., K — 1 , of 0, is controlled 
by a different parameter; for example, ali O^s may be assumed to be independent Gaussian vari- 
ables, each one with a different variance. In turn, variances are considered to be random variables 
that follow a statistical distribution controlled by another set of deterministic (not random) hy- 
perparameters; thus, a hierarchy of priors is adopted. As we will see later on, hierarchical priors 
are often designed using conjugate pairs of PDFs. 

- Noninformative or objective priors. The choice of the prior is done in such a way to embed 
as little extra information as possible and to exploit knowledge that is conveyed only by the 
available data. One way to construet such priors is to resort to information theoretic arguments. 
For example, one can estimate p(0) by minimizing its Kullback-Leibler (KL) divergence from 

p(0\y). 

• The fact that the Bayesian approach allows the recovery of all the desired information from a single 
data set does not suggest that the method is “cross-validation-free”. Maximizing the evidence, which 
at the same time guards against overfitting, does not necessarily mean that the performance of the 
designed estimator is optimized. This is more true in practice where, as we are going to see very 
soon, most often a bound of the evidence is optimized instead, to bypass computational obstacles. 
As is always the case in life, the proof of the cake is in the eating. Thus, the final verdict should only 
come from the generalization ability of the designed estimator, that is, its ability to make reliable 
predictions using before unseen data. Moreover, there is no reason to suggest that the evidence may 
be a reliable predictor of the generalization performance. This has been known and explicitly stated 
since the method’s infancy stage (see, for example, [ 27 ]). The generalization performance depends 
heavily on whether the adopted prior matehes the “true” distribution of the unknown parameters. 
This is nicely demonstrated with a toy example in [ 17 ]. It is shown that the Bayesian average is 
optimal only if the adopted prior coincides with the true one. The situation is less ciear when this is 
not the case. A more theoretical treatment of the topic, when there is a mismateh between the true 
and the selected prior, can be found in [ 1 8]. Thus, to be able to assess the generalization performance 
of a model learned via Bayesian inference, cross-validation is required, unless an independent test 
set can be afforded (for example, [44]). 

To avoid the need for cross-validation, an alternative way has been adopted by a number of 
authors. The cost function, to be minimized in ( 12 . 13 ), is built to quantify the generalization perfor¬ 
mance of an estimator; optimization then takes place concurrently for the unknown weights as well 
as the regularization parameter (see, for example, [ 15 , 32 ]). In general, this leads to a nonconvex op¬ 
timization task and such techniques have not (yet) been widely embraced by the machine learning 
community. 

• The Laplacian approximation to the evidence function is closely related to the Bayesian information 
criterion (BIC) [39] for model selection, which is expressed as 


\np(y\Mi)< 


' lnp(y\Mi.O M Ap) ~ -KlnN. 
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The BIC is obtained as a large N approximation to (12.38), assuming a broad enough Gaussian 
prior, and manipulating a bit the determinant involved in the last term. For a discussion including 
other related criteria, see [3,41]. 

• The Bayesian framework is also closely related to the minimum description length (MDL) methods. 
The log-evidence is associated with the number of bits in the shortest message that encodes the data 
via model A4, (for example, [45]). 

• Type II maximum likelihood : Note that the evidence is the marginal likelihood function after inte- 
grating out the parameters 0. To distinguish it from the MAP method, when the evidence function 
is maximized with respect to a set of unknown parameters, it is usually referred to as generalized 
maximum likelihood or Type II maximum likelihood and sometimes as empirical Bayes. Recall from 
Remarks 12.1 that the MAP is named Type I estimator. 


12.4 LATENT VARIABLES AND THE EM ALGORITHM 

At the end of Section 12.3, it was pointed out that if we assume that p(y\0) and p(6) are Gaussians of 
the form given in (12.39), then the evidence function associated with the regression task in Eq. (12.3) is 
also Gaussian parameterized via the (hyper)parameters a~, a f j. Let us denote this set of unknown non- 
random parameters as § = [er“, ctg] T , and we can write p(y: £). Maximizing the evidence with respect 
to £ becomes a typical maximum likelihood one. However, in general, such closed form expressions 
for the evidence function are not possible, and the integration in (12.14) is intractable. The main source 
of difficulty is the fact that our regression model is described via two sets of random variables, that is, y 
and 0, yet only one of them, y, can be directly observed. The other one, 0, cannot be observed, and this 
is the reason that the Bayesian philosophy tries to integrate it out of the joint PDF p(y,0). If 0 could 
be observed, then the unknown set of parameters § could be obtained by maximizing the likelihood 
p(y. 0 ; §), given a set of (joint) observations of (y, 0). Because they cannot be observed, the random 
variables in the vector 0 are known as hidden variables. 

Although we introduced the notion of hidden variables via our familiar regression task, unobserved 
variables (besides the noise) occur very often in a number of problems in probability and statistics. In 
a number of cases, from a larger set of jointly distributed random variables, only some can be observed 
and the rest remain hidden. Moreover, it is often useful to build hidden variables into a model by 
design. These variables are meant to represent latent causes that influence the observed variables and 
their introduction may facilitate the analysis. Often, such models associate one extra variable for each 
one of the observations. We will refer to such unobserved variables as latent variables. Their difference 
with the hidden ones is that their number is equal to that of the observations and grows accordingly as 
more observations get available. In contrast, unobserved random variables that are associated with the 
model and not with each one of the observations, individually, will be referred to as hidden variables. 

12.4.1 THE EXPECTATION-MAXIMIZATION ALGORITHM 

The expectation-maximization (EM) algorithm is an elegant algorithmic tool to maximize the likeli¬ 
hood (evidence) function for problems with latent/hidden variables. We will state the problem in a 
general formulation, and then we will apply it to different tasks, including regression. 
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Let x be a random vector and let X be the respective set of observations. Let X 1 := {x\,..., xL } be 
the corresponding set of latent variables; these can be either of a discrete or of a continuous nature. Each 
observation in X is associated with a latent vector x 1 in X 1 . These latent variables are also known as 
local ones and each one expresses the hidden structure associated with the corresponding observation. 
We will refer to the set {X, X 1 } as the complete data set and to the set of observations, X, as the 
incomplete one. Hidden random parameters, 0, can also be dealt with as latent variables; however, their 
number remains fixed (independent of N ) and they are also known as global variables. In such cases, 
the complete data set is {X, X 1 , 0}. To unclutter notation, we will focus on the set A’ 7 ; yet, everything 
to be said also applies to hidden/global as well as to a combination of local/latent and global/hidden 
variables. Furthermore, let the corresponding joint distribution be parameterized in terms of a set of 
unknown nonrandom (hyper)parameters, §. We further assume that, although X 1 cannot be observed, 
the posterior distribution p(X l \X\ §) (P(X , \X; £) for the discrete case) is fully specified, given the 
values in § and the observations in X. This is a critical assumption for the EM algorithm. If the posterior 
PDF is not known, then one has to resort to variants of the EM which attempt to approximate it. We 
will come to such schemes in Section 13.2. 

If the complete log-likelihood p(X, X 1 ', £) were available, then the problem would be a typical 
maximum likelihood one. However, because no observations for the latent variables are available, the 
EM algorithm considers the expectation of the complete log-likelihood with respect to the latent vari¬ 
ables associated with X l \ this operation is possible, because the posterior distribution p(X 1 \X: £) is 
assumed to be known, provided that § is known. It can be shown that maximizing this expectation is 
equivalent to maximizing the corresponding evidence function p(X\ £) (see Problem 12.3 and Sec¬ 
tion 12.7). To this end, the EM algorithm builds on an iterative philosophy, initialized by an arbitrary 
value § <0) . Then it proceeds along the following steps. 

The EM algorithm 

1. Expectation E-step: At the (j + l)th iteration, compute p(X l \X, £ , ,) ) and 


Q^,^)=E[lnp(X,X 1 -^)], 


(12.40) 


where the expectation is taken with respect to piX 1 \X: i; ( ,) )■ 

2. Maximization M-step: Determine § (7 ' +1 > so that 


$ 


(7+t) _ 


argmax Q($, § (;) ). 


(12.41) 


3. Check for convergence according to a criterion. If it is not satished go back to step 1. 

A possible convergence criterion is to check whether ||§ </+1) — § (/l || < e, for some user-defined con¬ 
stant e. The use of the EM algorithm presupposes that working with the joint PDF p(X, X 1 : £) is 
computationally tractable. This is, for example, the case when working within the exponential family 
of PDFs, where the E-step may require only the computation of a few statistics of the latent variables. 
The exponential family of distributions is a computationally convenient one and it is treated in more 
detail in Section 12.8. 
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Remarks 12 . 3 . 

• The EM algorithm was proposed and given its name in the seminal 1977 paper by Arthur Dempster, 
Nan Laird, and Donald Rubin [12]. The paper generalized previously published results, as for ex- 
ample [2,38], and had a significant impact as a powerful tool in statistics. The complete convergence 
proof was given in [47]. See, for example, [29] for a related discussion. 

• It can be shown that the EM algorithm converges to an (in general, local) maximum of p(X; %), 
which was our original goal. The likelihood ne ver decreases. The convergence is slower than 
the quadratic convergence of Newton-type searching techniques, although near an optimal point 
a speedup may be possible. However, the convergence of the algorithm is smooth and its complex- 
ity more attractive to Newton-type schemes, with no matrix inversions involved. The keen reader 
may obtain more information in, for example, [14,30,33,43]. 

• The EM algorithm can be modified to obtain the MAP estimate. To this end, the M-step is changed to 
(Problem 12.4) 

§0+U = argmax |g(§, § ( ^) + ln p(§) j , (12.42) 

where /?(§) is the prior PDF associated with §, if it is considered to be a random vector. 

• The EM algorithm can be sensitive to the choice of the initial point § <0) . In practice, one can run the 
algorithm a number of times, starting from different initial points, and keep the best of the results. 
Other initialization procedures have also been used, depending on the application. 

• Missing data : The EM algorithm can also be used to cope with cases where some of the values 
from the observed training data are missing. Missing values can be treated as hidden variables and 
maximization of the likelihood can be done by marginalizing over them. Such a procedure makes 
sense only if data are missing at random', that is, the cause of missing data is a random event and 
does not depend on the values of the unobserved samples. 


12.5 LINEAR REGRESSION AND THE EM ALGORITHM 

The Bayesian viewpoint to the regression task was considered in Section 12.2.3 via the Gaussian model 
assumption for p(y\0) and p(0), given in (12.9) and (12.8), which subsequently led to a Gaussian 
posterior for p(0\y), given in (12.16). In the current section, and for the sake of presentation simplicity, 
we will adopt the special case of diagonal covariance matrices, that is, E v = cr~I, Eg = Ogl , and 

0 ° - °. 

Our goal now becomes to consider and ag as (nonrandom) parameters and to obtain their val¬ 
ues by maximizing the corresponding evidence function in (12.15). To this end, we will use the EM 
algorithm. Following the notation that we have adopted so far for the regression task, the observed 
variables are the outputs, y, and the unobserved ones comprise the random parameters, 0, that define 
the regression model. Hence, in the current context, y will replace X and 0 will take the place of X 1 in 
the general formulation of the EM algorithm in Section 12.4.1. 

A prerequisite in order to apply the EM procedure is the knowledge of the posterior, which for this 
case is known, given the values of the parameters. We will work with the precision variables, and the 
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parameter vector becomes 

$ = [«, P] T , a = -4 and P = \- 

The EM algorithm is initialized with some arbitrary positive values u (i>> and p liy> . Then the algorithm 
at the (j + l)th iteration step, where cA-A and are assumed known, proceeds as follows: 

• E-Step: Compute the posterior p{6\y \§ (7 ^), which according to (12.16) and for Oq — 0 is fully 
specified if we compute its mean and covariance matrix, using (12.19) and (12.20), that is, 


( 12 . 43 ) 

( 12 . 44 ) 

Compute the expected value of the log-likelihood associated with the complete data set; this is given 

hy 

ln p(y, 0\ §) := ln p(y, 0\ a, /3) = ln (p{y\6\ P)p(0\ a)), 


He\ y = P (j) 4iyy- 


ln p(y,0;a,P) 


N K p 

N K\ 

- + - ln(2jr). 


<t>0\ 


Ct » 

2 00 


( 12 . 45 ) 


Treating the hidden parameters as random variables, the expected value of (12.45), with respect to 0, 
is carried out via the Gaussian posterior defined by (12.43) and (12.44). To this end, the following 
steps are adopted. 

1. To compute E [0 y 0], recall the definition of the respective covariance matrix, 


or 


which results in 


4S= E [(0-A^)(0-Ag) r ] 


A:= 


:= E = E j^trace 100 7 } J 


= trace 


(j ) 


U)T + 4il 


^9\y 


+ trace \Z, 


'e\y 


( 12 . 46 ) 


( 12 . 47 ) 


( 12 . 48 ) 
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2. To compute E [||_y — <t>0|| 2 ], define i|r := y — $0, and use the previous rationale to compute 
E [i|r 7 i[r], which leads to (Problem 12.5) 


Hence, 


B := 


:=e[|| 



■V - 


2 

+ trace 




/ (n \ N K B a {N K\ 

Q (a, /3; a 0) , /6 0) J = — ln0 + — lna — ~^B — —A — ( — + — ) ln(2 n). 

• M-Step: Compute 

a 0+D . —Q( a ,p;a u \p u) ) =0, 

da \ / 

j8°' +1) : P; a y) , ^ 0) ) =°, 


(12.49) 


(12.50) 


which trivially lead to 


a c+n = 
/i (2+1) = 


K 




trace 117, 
N 






+ trace { 


(12.51) 

(12.52) 


Once the algorithm converges, the resulting values for a and /1 are used to completely specify the 
involved PDFs, which can be used either to obtain an estimate of 0, for example, 0 — E[0|_y ], or make 
predictions via (12.21). 

Example 12.2. In this example, the generalized linear regression model of Example 12.1 is reconsid- 
ered. The goal is to use the EM algorithm of Section 12.5, as summarized by the recursions (12.43), 
(12.44), (12.51), and (12.52). The variance of the Gaussian noise used in the model to generate the data 
was set equal to er 2 = 0.05. The number of training points was N — 500. For the EM algorithm, both a 
and p were initialized to one. The correct dimensionality for the unknown parameter vector was used. 
The recovered values after the convergence of the EM were a = 1.32 corresponding to a} = 0.756 and 
/3 — 19.96 corresponding to a 2 = 0.0501. Note that the latter is very close to the true variance of the 
noise. Then, predictions of the output variable y were performed at 20 points, using (12.22) and the 
value of fi s | v recovered by the EM algorithm, via (12.44). 

Fig. 12. 5A shows the predictions together with the associated error bars, computed from (12.23) 
using the values of ct 2 and ajj obtained via the EM algorithm. Fig. 12. 5B shows the convergence curve 
for er 2 as a function of the number of iterations of the EM algorithm. 
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FIGURE 12.5 

(A) The original graph from which the training points were sampled. In red, the respective predictions y and asso- 
ciated error bars for 20 randomly chosen points are shown. (B) The convergence curve for <r^ as a function of the 
iterations of the EM algorithm. The red line corresponds to the true value. 


12.6 GAUSSIAN MIXTURE MODELS 

So far, we have seen a number of PDFs that can be used to model the distribution of an unknown 
random vector x e RA However, ali these models restrict the PDF to a specific functional term. Mixture 
modeling provides the freedom to model the unknown PDF, p(x), as a linear combination of different 
distributions, that is, 

K 

P (x) = J2 p kp(x\k), (12.53) 

k= 1 

where Pk is the parameter weighting the specific contributing PDF, p(x\k). To guarantee that pix) is 
a PDF, the weighting parameters must be nonnegative and add to one (J2k=i p k — I )■ The physical 
interpretation of (12.53) is that we are given a set of K distributions, p(x\k), k — 1,2,.... A'. Each 
observation x„, n = 1,2,..., A', is drawn from one of these K distributions, but we are not told from 
which one. All we know is a set of parameters, Pk , 1,2,..., K, each one providing the probability that 
a sample has been drawn from the corresponding PDF, p(x\k). It can be shown that for a large enough 
number of mixtu res, K, and appropriate choice of the involved parameters, one can approximate arbi- 
trarily close any continuous PDF. 

Mixture modeling is a typical task involving latent variables; that is, the labeis k of the PDF from 
which an obtained observation has originated. In practice, each p(x\k) is chosen from a known PDF 
family, parameterized via a set of parameters, and (12.53) can be rewritten as 

K 

p(x) = J2 p kP(x\k;i- k ), ( 12 . 54 ) 

k=l 

and the task is to estimate ( Pk,kk A k=l,2,...,K, based on a set of observations x n , n = 
1,2 . N. The set of observations X — |x„, n = I ...., /V} forms the incomplete set while the com¬ 

plete set {X, K,} comprises the sample pairs (x n , k n ), n = 1,..., N, with k„ being the label of the 
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distributiori (PDF) from which x n was drawn. Parameter estimation for such a problem naturally lends 
itself to be treated via the EM algorithm. We will demonstrate the procedure via the use of Gaussian 
mixtures. 

Let 


p(x\k; §*) = p(x\k; p k , E k ) = J\f (x\p k , E k ), 

where for simplicity we will assume that E k = a^I, k = I. K. We will further assume the obser- 

vations to be i.i.d. For such a modeling, the following hold true: 

• The log-likelihood of the complete data set is given by 

N N 

In p (X , K\ E, P) = ln P(Xn,kn-, S- K ) = J2 In (p(x n \k n ; f K )P kn ). (12.55) 

n= 1 n= 1 


We have used the notation 


E P = [Pi,P 2 ,...,Pk] T , and % k = [fi k ,a k ] T . 


In other words, the deterministic parameters that have to be estimated via the EM algorithm are the 
mean values and variances of ali the Gaussian mixtures, as well as the respective mixing probabili- 
ties. 

• The posterior probabilities of the latent discrete variables are given by 


Pik |jr; 


p(x\k\ $ k )P k 
Pix ; E,P) ’ 


where 


K 

Pix; 2, P) = ^2P k p(x\k; % k ). 

k= 1 


(12.56) 


(12.57) 


We have now all the ingredients required by the EM algorithm. Starting from E (0) and P i0 \ the 
( j + l)th iteration comprises the following steps: 

• E-step: Using (12.56) and (12.57), compute 



(12.58) 


which in turn defines 






618 CHAPTER 12 BAYESIAN LEARNING: INFERENCE AND THE EM ALGORITHM 


N 

Q (S, P, 3 0>, P^) = E [In (p(*« l k "; h n )Pk n ) 

n =1 

(*!*»; s0) ’ p0) ) ( ln ^ - ^ ln^ 2 

n=l £=1 

— 2 ll*n ~ /^ill 2 ) + G, (12.59) 

where C includes all the terms corresponding to the normalization constant. Note that we have 
finally relaxed the notation from k„ to k, because we sum up over all k, which does not depend on n. 
• M-step: Maximization of <2(3, P\ 3 f /) , P 1 ) with respect to all the involved parameters results in 
the following set of recursions (Problem 12.6): 

Set, for notational convenience, 

Ykn -.= P{k\x n \ S W ,P W ). 


Then 


(12.60) 


(12.61) 


(12.62) 


Iterations continue until a convergence criterion is met. The extension to the case of a general 
covariance matrix is straightforward by replacing (12.61) by 



Remarks 12.4. 

• To get good initialization for the EM algorithm, sometimes a simpler clustering algorithm, for ex- 
ample, the A-means (Section 12.6.1 and [41]), is run to provide an initial estimate of the means 
and shapes of clusters (covariance matrices), by associating each mixture with a cluster in the input 
space. Another simpler way is to select K points randomly from the data set. A more elaborate 
technique, which is commonly used, is to select them randomly but in such a way to make sure that 
the whole data set is represented in a balanced way (see, for example, [1]). 

• The number of mixtures, K, is usually determined by cross-validation (Chapter 3 ; see also [16]). 

• The mixing parameters Pp, k — 1,..., K, should be initialized by keeping in mind that they are 
probabilities and they have to add to one. 


„ 0+1) Ln=l 

P-k ~ 

Zwi=l Ykr 

2 0‘+i) ^ n=1 Ykn 

n 

x n -1 4 2+1) 

2 

'Etl Ykn 

r^En,, 

n= 1 
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• One of the problems that may be encountered in practice in the Gaussian mixture task is when 
one of the mixture components is centered at (or very close to) one of the data points, for example, 

= x„, for some values of k and n. In such a case, the exponent term of the respective Gaussian 
becomes one and the contribution of this particular component in the log-likelihood is equal to 
(2;rc^ 2 )~ ,/,2 . If, in addition, ct* is very small, this will lead the likelihood to a large value, although 
this is not indicative that the true model has been learned. Soon, we will see that the use of priors 
can alleviate such problems. 

• Identifiability : A further issue associated with the EM algorithm, in the context of distribution mix¬ 
tures, is that the obtained solution in the parameter space is not unique. For the case of K mixtures, 
for each solution (point in the parameter space) there are K\ — 1 other points which give rise to 
the same distribution. For example, let us fit a model of two Gaussians in the one-dimensional 
space, which will resuit in estimates for the respective mean values, fit\ and fii- However, in 
the corresponding parameter space, there is an uncertainty on whether these values detine the 
point fi a = [jl t, /i 2 ] r or the point fi b — [fi . 2 , fi\]. Both of these points give rise to the same dis¬ 
tribution. We say that the parameters in our model are not identifiable. A parameter (vector) which 
defines a family of distributions pix: 6 ) is said to be identifiable if pix: 0 \) fi p(x: Ofi) for 0 \ fi 62 
(see, e.g., [7]). Although in our context, where our interest is in computing pix), unidentifiability 
does not cause any problems, this can be an issue in cases where the focus of interest lies on the 
parameters (see, for example, [36]). 

• Mixtures of Student’s t distributions: A significant shortcoming of mixtures based on normal distri¬ 
butions is their vulnerability to outliers. The replacement of normal distributions with the heavier- 
tailed Student’s t distributions (see Section 13.5) has been proposed as a way to mitigate these 
shortcomings and a related treatment of the resulting model under an EM algorithmic framework 
has been conducted. Although the steps get a bit more involved, the ideas explored so far transfer 
nicely in this case too (see, for example, [9,10,35,37]). 

Example 12.3. The goal of this example is to demonstrate the application of the EM algorithm in the 
context of the Gaussian mixture modeling. The data are generated according to three Gaussians in the 
two-dimensional space, with parameters 

/G = [10,3f, /r 2 = [l,lf, *r 3 = [5,4f 


and covariance matrices 


271 = 


1 0 
0 1 


27 2 = 


1.5 0 
0 1.5 


r 3 = 


2 0 
0 2 


respectively. The number of the generated points is 300, with 100 points per mixture. The points are 
shown in Fig. 12.6, together with the gray circles indicating the 80% probability regions, for each one 
of the clusters. The EM algorithm comprising the steps (12.58) and (12.60)-(12.62) was run with the 
following initial values: 

,i[ 0) = [3,5] t , /rf = [2,0.4f, ^f = [4,3] r , 


(0) _ r (0) _ 1 0 

2 - 3 _ 0 1 


and 


X’{ 0) 
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FIGURE 12.6 

The curves (ellipses) indicate the 80% probability regions. The gray curves correspond to the true Gaussian clusters 
of Example 12.3. The red curves correspond to (A) the initial values for the mean and the covariance matrices, 

(B) the mixtures recovered by the EM algorithm after five iterations, and (C) after 30 iterations. (D) The log- 
likelihood as a function of the number of iterations. 


The probabilities were initialized to their true values P | (0 ' ) = P^ 1 = /k'" 1 = 1/3. The red curves in 
Fig. 12.6 correspond to the mixtures recovered by the EM algorithm at (A) the initial estimates, (B) af¬ 
ter five iterations, and (C) after convergence. Fig. 12. 6D shows the log-likelihood as a function of the 
number of iterations. 

Fig. 12.7 corresponds to a different setup. This time, the mean values were initialized at points very 
far from the true ones, that is, 

= [10,13] r , /4 0) = [ll,12f, /4 0) = [ 13, ll] 7 ’, 

while the covariances and probabilities were initialized as before. Observe that in this case, the EM 
algorithm fails to capture the true nature of the problem, having been trapped in a local minimum. 

12.6.1 GAUSSIAN MIXTURE MODELING AND GLUSTERING 

Clustering or unsupervised learning is an important part of machine learning, which is not treated in this 
book. Extensive coverage of clustering is given in, for example, [41]. However, the mixture modeling 
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(A) (B) 




Iterations 


FIGURE 12.7 

This is the counterpart of Fig. 12.6, where now the initial values for the means are very far from the true ones. In 
this case, the EM fails to recover the true nature of the mixtures and has been trapped in a local minimum. 


task via the EM offers us a good excuse to say a few words. Without going into formal definitions, the 
task of clustering is to assign a number of points, x\,..., xiy, into K groups or clusters. Points that 
are assigned to the same cluster must be more “similar” than points which are assigned to different 
clusters. Some clustering algorithms need the number of clusters, K, to be provided by the user as an 
input variable. Other schemes treat it as a free parameter to be recovered from the data by the algorithm. 
The other major issue in clustering is to quantify “similarity.” Different definitions end up with different 
clusterings. A clustering is a specific allocation of the points to clusters. In general, assigning points 
to clusters according to an optimality criterion is an NP-hard task (see, for example, [41]). Thus, in 
general, any clustering algorithm provides a suboptimal solution. 

Gaussian mixture modeling is among the popular clustering algorithms. The main assumption 
is that the points which belong to the same cluster are distributed according to the same Gaussian 
distribution (this is how similarity is defined in this case), of unknown mean and covariance ma¬ 
trix. Each mixture component defines a different cluster. Thus, the goal is to run the EM algorithm 
over the available data points to provide, after convergence, the posterior probabilities P(k\x „), k = 
1,2 ,, K, n = 1,2,..., /V, where each k corresponds to a cluster. Then, each point is assigned to 
cluster k according to the rule 
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assignjc,, to cluster k — argmax.P(/|jc„), i = 1,2. K. 


The EM algorithm for clustering can be considered to be a refined version of a more primitive scheme, 
known as the k-means or isodata algorithm. In the EM algorithm, the posterior probability of each point 
x n , with respect to each one of the clusters k, is computed recursively. Moreover, the mean value fi k of 
the points associated with cluster k is computed as a weighted average of all the training points (12.60). 
In contrast, in the k-means algorithm, at each iteration the posterior probability gets a binary value in 
{1, 0}; for each point x„, the Euclidean distance from ali the currently available estimates of the mean 
values is computed, and the posterior probability is estimated according to the following rule: 



n - lL k \\ 2 < II*,, - /ij || 2 , j^k, 


otherwise. 


The k-means algorithm is not concerned about covariance matrices. Despite its simplicity, it is not 
an exaggeration to say that it is the most well-known clustering algorithm, and a number of theoretical 
papers and improved versions have been proposed over the years (see, for example, [41]). Due to its 
popularity, we will take the liberty to state it in Algorithm 12.1. 

Algorithm 12.1 (The k-means or isodata clustering algorithm). 

• Initialize 

- Select the number of clusters K. 

- Set fi k , k = 1, 2,..., K, to arbitrary values. 

• For n = 1,2,.. ., N, Do 

- Determine the closest cluster mean, say, fi k , to x n . 

- Set b(n ) = k. 


• End For 


• For k — 1,2, ..., K, Do 

- Update /i k , k= 1, 2,..., K, as the mean of all the points with b(n) = k, n — 1,2,..., N. 

• End For 

• Until no change in fi k , k = 1,2,..., K, occurs between two successive iterations. 

The k-means algorithm can also be derived as a limiting case of the EM scheme (for example, 
[34]). Note that both the EM algorithm and the k-means one can only recover compact clusters. In 
other words, if the points are distributed in ring-shaped clusters, then this type of clustering algorithms 
is not appropriate. 

Fig. 12. 8A shows the data points generated by two Gaussians; 200 points from each one. The points 
are shown by red and gray colors, depending on the Gaussian that generated them. Of course in clus¬ 
tering, the data points are given to the algorithm without the “color” (labeling). It is up to the algorithm 
to make the partition in clusters. For both the EM and the k-means algorithm, the correct number of 
clusters (K — 2) was given. The k-means was initialized with zero mean values. Fig. 12. 8B shows 
the clusters formed by the k-means and Fig. 12. 8D shows the clusters formed by the EM algorithm. 
Fig. 12. 8C shows the Gaussians that were used for the initialization of the EM. 
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(A) 


(B) 




(C) 


(D) 


FIGURE 12.8 

(A) The data points generated by two Gaussians (red and gray). (B) The recovered clusters by the k -means (red and 
gray). (C) The 80% probability curves for the initialization of the EM algorithm. (D) The final Gaussians obtained 
by the EM algorithm with the respective clusters. 


Fig. 12.9 shows the respective sequence of figures, which corresponds to points obtained by the 
same Gaussians; however, now, there is an imbalance in the number of the points, as only 20 points 
spring forth from the first one and 200 points from the second. Observe that the k-means has a problem 
in recovering the true clustering structure; it attempts to make the two clusters more equally sized. 
A number of techniques and versions of the basic k-means scheme have been proposed to overcome 
its drawbacks (see [41]). Finally, it must be stressed that both the EM and the 4-means algorithm will 
always recover as many clusters as the user-defined input variable K dictates. In the case of the EM 
algorithm, this drawback is overcome when the variational EM algorithm is used, as will be discussed 
in Section 13.4. 


12.7 THE EM ALGORITHM: A LOWER BOUND MAXIMIZATION VIEW 

In Section 12.4, the EM algorithm was introduced and it was stated that maximizing the expectation 
of the complete log-likelihood with respect to the set of parameters £ is equivalent to maximizing the 
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(C) 


(D) 


FIGURE 12.9 

(A) The data points generated by two Gaussians (red and gray). One of the clusters consists of only 20 points and 
the other one of 200 points. (B) The recovered clusters by the &-means (red and gray). Observe that the algorithm 
has not identified the correct clusters, by assigning more points to the “smaller” one. (C) The 80% probability 
curves for the initialization of the EM algorithm. (D) The final Gaussians obtained by the EM algorithm, with the 
respective clusters. 


corresponding evidence function, i.e., p(X: £). However, this was not explicitly shown. In this subsec- 
tion, this connection will become ciear. It will be shown that the EM algorithm basically maximizes 
a tight lower bound of the evidence. Furthermore, this interpretation of the EM will allow for gener- 
alizations to extend it to cases where the computations involving the posterior, p(X l \X; §), are not 
computationally tractable. 

Let us consider the functional 





q(X')\n 


P(X,X L ^) 

q(X l ) 


dX' 


(12.63) 


6 A functional is an operator that takes as input a function and returns a real value. It is a generalization of our familiar functions, 
where now the inputs are also functions. 
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where q(X j ) is any nonnegative function that integrates to one; that is, it is a PDF defined over the 
latent variables. The functional T depends on § and on q, and its definition bears a strong similarity to 
the notion of free energy, used in statistical physics. Indeed, (12.63) can be written as 

T(q,$) = J q{X')lnp(X, X l \$)dX l + H, (12.64) 

where 

H = ~ J q(X')lnq(X')dX l 

is the entropy associated with q(X'). If one detines — ln p{X, X 1 : !•) as the energy of the system, 
(, X , X 1 ), then J-(q, §) represents the negative of the so-called free energy [34]. Focussing on the first 
term on the right-hand side in Eq. (12.64), this can be written as 


F(q,$)=E q [\np(X, X 1 - ?)] + //. 


(12.65) 


ln other words, the first term on the right-hand side is very similar to the Q term in Eq. (12.40). The 
only difference is that the expectation here is taken with respect to q. 

Taking into account Bayes’ theorem, Eq. (12.63) becomes 


F(q, 


S) = f q(X l )\n 
= j q(X l )\n 


p{X l \X;$)p{X-,$) JvI 

-;- Cl/C , 

q(X>) 
p(X'\X;i;) 


q(X') 


dX' + ln p(X; |), 


( 12 . 66 ) 


where the latter results because ln p(X: §) does not depend on q(X l ) and the latter integrates to one, 
being a probability distribution. The first term on the right-hand side is the negative of the so-called KL 
divergence (Eq. (2.161)) between q(X l ) and p(X'\X; tj), which we will denote as KL(q || p). Recall 
that the KL divergence measures how different two distributions are. If the two involved distributions 
are equal, then their KL divergence becomes zero. Thus, finally, we get 


ln P {X- $) = .F(9,$) + KL( 9 || p). 


(12.67) 


Because the KL divergence is a nonnegative quantity, i.e., KL(g || p) > 0 (Problem 12.7), it turns out 
that 


ln P (X- S)>F(q,$). 


( 12 . 68 ) 


In other words, the functional T{q, $) is a lower bound of the log-likelihood function. Also, the bound 
becomes tight if KL(g || p) — 0, which is true if and only if q(X l ) = p(X'\X ; §). Moreover, the pre- 
vious bound is valid for all distributions q and ali parameters §. 

The previous findings pave the way of maximizing ln p(X: f) by trying to maximize its lower 
bound. This is in line with a more general class of optimization algorithms, known as minorize- 
maximization (or mazorize-minimization) (MM) methods [20]. A surrogate function that minorizes 
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the cost function (lower bound), which is easier to be maximized, is employed. Then maximizing this 
lower bound iteratively pushes the cost function to a local maximum. Note that in our case maxi- 
mization of T involves two terms, namely, q and §. We will adopt a widely used technique known as 
altemating optimization. Such an approach imposes an iterative procedure. Starting from some initial 
conditions, one “freezes” the value of one of the involved terms and maximizes with respect to the 
other. Then the value of the latter term is frozen and optimization is carried out with respect to the 
former. These altemating steps carry on until convergence. In our current context, starting from an 
arbitrary the (j + l)th iteration comprises the following steps: 

• Step 1 : Keeping § ( , ) fixed, optimize with respect to q. This step tightens the lower bound in ( 12 . 68 ). 
This is achieved if KL(f/ || p) = 0, and it can only happen if 

q u+]) (X l ) = p(X'\X\ $ u) ), (12.69) 

that is, if we set q{X l ) equal to the posterior given X and as ( 12 . 67 ) suggests, this makes the 
bound tight, that is, 

ln p(X- $«>) = T (p{X l \X- f u) ), $W) . (12.70) 

• Step 2 : Fixing q^ +l \ insert it in the place of q in ( 12 . 68 ), and because the bound holds for any q, 
maximize with respect to §, that is, 

^O+D = argmax T (p (x l \X\ ^ . 

Hence, we have now obtained the following inequalities: 

ln p [X- f' +1 ) > T ^p (X 1 \X- , f /+1 ) > T (p (X 1 \X- , f') , 

and taking into account that the last term on the right-hand side is equal to p ( X\ § ' 1, we have 


]np(X;$j + 1 )>lnp(X;$jy 


It is now readily seen that we have rederived the EM algorithm. Indeed, from the definition of 
■) in (12.63) we obtain 

^(p(^ , |^;€°' ) ). 5 ) = Q«. 5 c/) )- j p{x l \x-^)\np{x l \x-^)dx l , (12.71) 

where Q(§, ^ 1 y *) is the same as in (12.40) and the second term on the right-hand side is independent 
of this latter term is equal to the entropy associated with c/- /+l * (X 1 ). The rederivation of the EM 
via this path makes it ciear that the quantity that is maximized is the log-likelihood, I11 p(X: £), and 
that its value is guaranteed not to decrease after each combined iteration step. Fig. 12.10 illustrates 
schematically the two EM steps comprising the ( j + l)th iteration. 
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KL^WlIpW) 

•T(9 0) >£ (j) ) 




Hl U+1 \S U) ) lnpW 

I 

KL(g^+ 1 )||p ( f. ) ) = 0 


lnj/ J+1 ) 


■7 r (9 (j+1) ,£ W+1) ) 


1 


KL(g« +1 )||p ( f +1) ) 


(E) step 1 


(M) step 2 


FIGURE 12.10 

The E-step adjusts qU) := q^HX 1 ) so that its KL divergence from p { y := p{X'\X; becomes zero. The 
M-step maximizes with respect to £. We used pi to denote the evidence function. 


Needless to say that the EM algorithm is not a panacea. We will soon seek variants to deal with 
cases where the posterior cannot be given in an analytic form. Moreover, there are stili cases where the 
M-step can be computationally intractable. To this end, several variants have been proposed (see, for 
example, [31,34]). 

Remarks 12 . 5 . 

• Online versions of EM: We have already pointed out that in many cases of large data applications, 
Online versions are the preferable choice in practice. The EM algorithm is no exception, and a 
number of related versions have been proposed. In [34], an Online EM algorithm is proposed based 
on the lower bound interpretation. In [6], stochastic approximation arguments have been employed. 
In [25], a comparative study of different techniques is reported. 

• Often in practice, carrying out the expectation step may be intractable. Later on in the next chapter, 
we will see variational methods as a way to overcome this obstacle. An alternative path is to employ 
Monte Carlo sampling techniques (Chapter 14) to generate samples from the involved distributions 
and approximate the expectation with the computation of the respective sample mean (see, for 
example, [8,1 1]). 


12.8 EXPONENTIAL FAMILY OF PROBABILITY DISTRIBUTIONS 

It must be ciear by now that the Bayesian setting starts by adopting a specific functional form for the 
conditional distribution, which “explains” the generation of the observations given the parameters, and 
a prior distribution that describes the randomness of the associated parameters. The latter is equivalent 
to regularizing the corresponding learning task and expresses our uncertainty about the values of the 
parameters, prior to receiving any observations. The goal in Bayesian learning is to obtain the poste¬ 
rior distribution of the parameters given the values of the observed variables. The computation of the 
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posterior can be greatly facilitated if the conditional and the prior distributions are carefully chosen. Se- 
lecting these distributions from the exponential family makes the computation of the posterior a rather 
trivial task. Distributions of the exponential family will be used in the next chapter, where approximate 
Bayesian inference methods are discussed. 

We will treat the topic of the exponential family of probability distributions in a general setting. 
Let x e be a random vector and 0 e a random (parameter) vector. We say that the parameterized 
PDF p(x\0) is of the exponential form if 


P(x\0) = g(0)f(x)ex p p' ( 0)u(x )) , 


(12.72) 


where 


8 ( 0 ) = 


1 

f /(jt)exp(0 T (0)u(x))dx 


(12.73) 


is the normalizing constant of the PDF. A similar definition holds if x is a discrete random variable and 
the respective function represents the probability mass function P(x\0): in this case, the integration in 
(12.73) becomes a summation. The vector </>(6) comprises the set of the so-called natural parameters, 
and /, u are functions defining the distribution. It is readily seen from the factorization theorem in 
Section 3.7 that u (x ) is a sufficient statistic for the parameter 0. Note that an attribute of the exponential 
family is that the number of sufficient statistics, that is, the dimensionality of u, is finite and remains 
independent of the number of observations. If <j>(0) — 0, then the exponential family is said to be in 
canonical form. A number of widely used distributions belongs to the exponential family, for example, 
the normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, binomial, and multinomial 
distributions. Examples of distributions that do not belong in this family are the uniform with unknown 
bounds, Student’s t, and most mixture distributions (for example, [40,46]). 

An advantage of the exponential family is that one can find conjugate priors for 0; that is, priors 
that lead to posteriors, p(0\X), of the same functional form as p(0) (Section 3.11.1, Remarks 3.4). 

Given (12.72) its conjugate prior is given by 


p( 0; k, v) = h( A, v){g(0)Y exp (0)v), 


(12.74) 


where A. > 0 and v are known as hyperparameters ; that is, parameters that control other parameters. 
The factor h (A, v) is an appropriate normalizing constant. It is easy to see that defining the prior as in 
(12.74) and the likelihood function as in (12.72), the posterior p(0\x) is of the same form as in (12.74). 

Before we give some examples, let us investigate a bit more the role played by A and v , as well as 
the presence of g(0) and <j>(0) : in both (12.74) and (12.72). Assume that x and 0 obey (12.72)-(12.74) 
and let X — {xi,..., jc^} be a set of i.i.d. observations. Then 


and 


N 


p(X\0) = (g(0)) N ]""[ f (x„) exp <t> T (0)Y^ u (xi) 


n= l 


i=t 


N 


p(0\X) oc p(X\0)p{0) OC ( g(0)Y +N exp ( <t> 1 (0) 1> + “(*n) 


n = 1 


(12.75) 


(12.76) 
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In other words, the posterior has hyperparameters equal to 


X = X + N, v = v + u(x, j). 


( 12 . 77 ) 


n =1 


Interpreting (12.77), one can view X as being the effective number of observations that, implicitly, the 
prior information contributes to the Bayesian learning process and v is the total amount of information 
that these (implicit) X observations contribute to the sufficient statistic. Their exact values, basically, 
quantify the amount of prior knowledge that the designer wants to embed into the problem. 

Example 12.4. The Gaussian-gamma pair. Let our random variable x be a scalar and assume that 

p(x\a 2 )=JV(x\p.,a 2 ), ( 12 . 78 ) 

where /x is known and a 1 will be treated as an unknown random parameter. We will show that: 

1. p(x |<7 2 ) belongs to the exponential family. 

It is algebraically more convenient to work with the precision p — -Y. Thus, 


£ 1/2 

p(x\P) = ^=exp 
V2tt 


~P(x~p,y 


( 12 . 79 ) 


Thus, p{x\P) belongs to the exponential family with 

1 


and 


f(x) = 


g(P) = 


,—, <P(P) = -P, «W = -z(x - n)~ 

y/TjC 2 


1 


= fll/2 


/-» exp ( - \Pix - ii) 1 )dx 
2. The conjugate prior of (12.78) follows the gamma distribution. 

The respective conjugate prior from (12.74) becomes 

p(P\ X, v ) = li(X, v)P? exp(— Pv). 

This has the form of 

Gamma(/l|fl, b) = - b a p a ~ { exp(— bP), 

T(a) 


( 12 . 80 ) 


( 12 . 81 ) 


with parameters (Chapter 2) a = j + 1 and b — v and the normalizing constant h(X, v) being nec- 
essarily equal to b a / T(a). The function V(a) is defined as 


T(a) = 


F 


x a 1 e x dx. 
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If we are given multiple observations x n , n = 1,2,..., N, then the resulting posterior according to 
(12.76) and (12.77) will be a gamma distribution with 

lA , N „ , 

b = b + - / (x n — /x) = b + — om L , 

z «=t 


where denotes the maximum likelihood estimate of the variance (Problem 3.22). Hence, the 
physical meaning of b is that it quantifies our prior estimate about the unknown variance. This also 
ties nicely with what we have said in Section 3.7; is a sufficient statistic for the variance if this 
is the unknown parameter in a Gaussian. It can easily be shown that the conjugate prior with respect 
to ix, if a 2 is known, is a Gaussian (Problem 12.10). 

In the case of a multivariate Gaussian of known mean /x and unknown covariance matrix E 
(precision matrix Q — E~ l ), it can also be shown that it is of the exponential form and its conjugate 
prior is given by the Wishart distribution, 

W(Q\W, v) — /j| (21 H= 2 =i exp f — ^trace { W~ l £>} j , (12.82) 


where h is the normalizing constant (Problem 12.11) and W is an / x / matrix. The normalizing 
constant is given by 

V ( Vl IU- 1) J-r / V + 1 — i \ 

h = \w\~* —-—J 

which admittedly is quite intimidating; however, in Bayesian learning we have the luxury of by- 
passing the computation of the normalizing factor in (12.82). Once we express a PDF in terms of 
Q as in (12.82), then the normalizing constant has to be given by (12.83). The Wishart distribution 
is a multivariate analogue of the gamma distribution. 

Example 12.5. The Gaussian Gaussian-gamma pair: We will now treat both // and the precision A as 
unknown random parameters. We will show that: 


-l 


(12.83) 


1. p(x\fx, a 2 ) = AGx |/x, cr 2 ) is also of an exponential form. Indeed, for this case we have 


p(x\/x,a ) = p(x\n,P ) = 


. A 1/2 exp (-Af) 


\[7jx 


exp 




X 

X 


Hence, 

0 = [A,/H r . 0(0) = 

and performing the respective integration, we obtain 


h ’ 

2 

, u(x) — 

' x 2 ' 

Pix 


X 


f(x) = 


1 


-JTjx 


g(0) = A 1/2 expl - 


A/x J 


which proves the claim. 
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FIGURE 12.11 

Contour plots of the Gaussian-gamma distributiori with parameter values X = 2, v\ = 4, im = 0. 


2. The conjugate prior of p(x\p, a 2 ) is of a Gaussian-gamma form. 


We have 


p(p, A; A, v) = /r(A, v)P 2 exp 





which after some trivial algebra (Problem 12.12) gives 


p(p, X, v) = M [p -j-, (A/l) '^Gammal/i 


X + 1 v\ 
2 ’ ~2 



( 12 . 84 ) 


which is known as the Gaussian-gamma distribution with the Gaussian having mean value /xq = y 

and variance cr 2 = (Xf5 )~ 1 and the defining parameters of the gamma PDF are a = Ati anc j / ; 

, ' 

Y — Fig. 12.11 shows the contour plot of the Gaussian-gamma distribution of (12.84). 

For the more general case of a multivariate Gaussian, Af(x\fi, £), it turns out that it is also of an 
exponential form and its conjugate prior is of the Gaussian-Wishart form (Problem 12.13), that is. 


p(/g GiMo.A, W, 



fioA^Q)~ l )w(Q\W,v), 


where Q — E 1 . 

Example 12.6. We now turn our attention to discrete variables, and we will show that the multinomial 
distribution is of an exponential form and that its conjugate prior is given by the Dirichlet distribution. 

1 . Let zi, Z2, ■ ■ ■, zk be K mutually exclusive and exhaustive events. Let P\ , , Pk bc the respec- 

tive probabilities, hence Ylk=\ At = 1 ■ Let the experiment be repeated N times. Then the probability 
of the joint event, z,\ occurred x\ times, Z2 occurred xj times, and so on, is given by the multinomial 
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distributiori 


where 


P(X \,X2,...,X K ) = 


N 


N 

X] ...x K 

N\ 


n". 

*=t 


Xk 
k ’ 


X\...X K J Xl\X2\, ...,XK' 

Defining P — [Pi,..., Pk] T , Eq. (12.85) can be rewritten as 

K 

P(x 1 ,...,x K \P)=[ _ " _ ) ]~[ exp (x k In P k ) 


N 

X\...X K J 
N 

Xi,...,X K 


Thus, the multinomial is of an exponential form with 


k= 1 


exp ^x,ln . 


<t>(P) = [lnPi, lnP 2 , • • ■, In Pk] , 

U(X) = [X\,X2, • • • 

and because probabilities sum to one, we obtain 

N 

X\ ...X K 


g(P)= 1 , /(*) = 

2. The conjugate prior of (12.86) can then be written as 


p(P\ X, v) = h(X, v) exp (e Vk ln P k ^j 

<*T\ p k k ’ 


k= 1 


(12.85) 


( 12 . 86 ) 


(12.87) 


which is a Dirichlet PDF. If we let Vk := cik — 1, we bring (12.87) in the more Standard formulation 

K K 


p(P\ a) — 


T(5) 


r(fli)...r(a^) 


Y\ p k x > E^ =l ’ 


( 12 . 88 ) 


k =1 *=1 

where the normalization constant (Chapter 2) has been plugged in, with 

K 


a 


■■=J2 ak ' 


k= 1 
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12.8.1 THE EXPONENTIAL FAMILY AND THE MAXIMUM ENTROPY METHOD 


Besides the computational advantages associated with the exponential family, there is another reason 
that justifies its high popularity. Assume that we are given a set of observations, x n e A x = 

1,2. N, drawn from a distribution whose functional form is unknown. Our goal is to estimate 

the unknown PDF; however, we require that it will respect certain empirical expectations, which are 
computed from the available observations, that is, 



(12.89) 


n= 1 


where X is an index set and z<; : A x i—> R, i eX are specific functions. For example, if u, (x) =x, 


then jli is the sample mean. In such cases, it is not sensible to adopt a parametric functional form for 
the PDF and try to optimize with respect to the unknown parameters, for example, via the maximum 
likelihood method; in general, we cannot know if an adopted functional form can comply with the 
available empirical expectations. 

The maximum entropy (ME) method (sometimes called principle) offers a possible way to estimate 
the unknown PDF, subject to the set of the available constraints [21]. According to this method, the 
cost function to be maximized is the entropy (Section 2.5.2) associated with the PDF, that is, 


H := — / p{x) ln p(x) dx 

Ja x 


(12.90) 


It is well known from Shannon’s information theory that the entropy is a measure of uncertainty or 
randomness. Maximization of the entropy with respect to p(x) results in the most random PDF, subject 
to the available constraints. Seen from another point of view, such a procedure guarantees that the 
estimation of an unknown PDF is carried out by adopting the lowest number of assumptions, that is, 
only the available set of constraints. For our case, the maximum entropy estimation method is cast as 
follows: 


respect to p(x) — / p(x)\n p(x)dx, 
JA X 


maximize with 



subject to E[m,-(x)]= / p(x)u 
JA X 


(12.91) 


In addition to the previous set of constraints, one has to consider the obvious one that guarantees that 
p(x) integrates to one, that is, 



In the case of discrete variables, the involved integrations are replaced by summations. Solving the 
optimization task in (12.91), it turns out that (Problem 12.15) 
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that is, the ME estimate is of an exponential form. The parameters 0,-, i e I, are the Lagrange multi- 
pliers used in the optimization task and their values are determined via the constraints and are given in 
terms of the available empirical expectations, A/ , i & T. If no constraint is used other than the obvious 
(normalizing) one and A x = [a, b] C R, then the resulting PDF is the uniform distribution, p(x) = C; 
indeed, this is the most random one, because it shows no preference for any specific interval of values. 
If two constraints are used such that u i(x) = x and ujix) — x 2 the resulting PDF is the Gaussian one, 
because the exponent is of a quadratic form (chapter’s appendix). In other words, the Gaussian is the 
most random PDF, subject to two constraints related to the mean and the variance. Note that although 
we focused on real-valued random variables, everything is trivially extended to vector-valued ones. 
An interesting discussion concerning the ME method and alternative views of the problem is provided 
in [42]. 


12.9 COMBINING LEARNING MODELS: A PROBABILISTIC POINT OF VIEW 

The idea of combining different learners to boost the overall performance by exploiting their individual 
characteristics was introduced in Section 7.9. We now return to this task via probabilistic arguments. 
The section is also useful from a pedagogical point of view, to familiarize the reader further with the 
use of the EM algorithm. 

Our starting consideration is that the data are distributed in different regions of the input space. 
Thus, it seems reasonable to fit different learning models, one for each region. This idea reminds 
us of the decision trees treated in Chapter 7. There, axis-aligned (linear) splits of the input space 
were performed. Here, the input space will be split via hyperplanes (generalizations to more general 
hypersurfaces are also possible) in a general position. Moreover, the main difference lies in the fact 
that in CARTS, the splits were of the hard-type decision rule. In the current setting, we adopt a more 
relaxed attitude and we are going to consider soft-type probabilistic splits, at the expense of some loss 
in interpretability. 

The basic concept of the combining scheme of this section is illustrated in Fig. 12.12. It is com- 
mon to refer to each one of the K learners as an expert. At the heart of our modeling approach lie 
the so-called gating functions, g k (x), k = 1,2,..., K, which control the importance of each expert 
towards the final decision. These are optimally tuned during the training phase, together with the set of 
parameters, 0 k , k = 1 , 2 , .. ., K, which parameterize the experts, respectively. In the general case, the 
gating functions are functions of the input variables. We refer to this type of modeling as mixture of 
experts. In contrast, the special type of combination, where these are parameters and not functions, that 
is, gk(x) = gk, will be referred to as mixing of learners. We will focus on the latter case and present 
the method in the context of the regression and classification tasks, using linear models. 

12.9.1 MIXING LINEAR REGRESSION MODELS 

Our starting point is that each model is a linear regression model, Ok, k = 1, 2,..., K, where the 
dimensionality of the input space has been assumed to increase by one to account for the intercepts and 
the output variables are related to the input according to our familiar equation, 

y k = 0 I k x + r), 


( 12 . 93 ) 
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FIGURE 12.12 

A block diagram of a mixture of experts. The output of each expert is weighted according to the outputs of the 
gating network. In the general case, these weights are considered functions of the input. 


where r| is a white Gaussian noise source with variance <ry and it is assumed to be common for all 
models; extension to more general cases can be obtained in a straightforward way. For generalized 
linear models, x can simply be replaced by the nonlinear mapping <l>(x). We assume that the gating 
parameters are interpreted as probabilities, and they will be denoted as gk — /\ . 

Under the previous assumptions, the following mixing model is adopted: 

K 

p(y ; S,P) = £] P k my\ol x ,tf), (12.94) 

k=\ 


where 

H ■.= [0\,...,0 T K ,a%\ T , P:=[Pi,...,P K f (12.95) 

are the vectors of the unknown parameters, to be estimated during the training phase using the set of 
training points, (y„, jc„), n = 1,2 ,,N. Because each model is designed to be “in charge” of one 
region in space, the corresponding parameters should be trained using input samples that originate 
from the respective region; note, however, that the regions are not known and have to be learned during 
training as well. This is in analogy with the task of Gaussian mixture modeling; recall that during 
training, each observation was associated with a specific mixture component via the use of a hidden 
variable. In the current setting, each input sample will be associated with a specific learner. Thus, our 
current task is a close relative of the one treated in Section 12.6 and we could follow similar steps to 
derive our results. However, for the sake of variety, a slightly different route will be taken. This will 
also prove useful later on and at the same time fits slightly better with the jargon used for the current 
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formulation. Instead of the indices k n , used in Gaussian mixture modeling, we will introduce a new 

set of hidden variables, z n k e {0, 1}, k= 1,2,_ K , n = 1 , 2,..., N. If z n k = 1 , then sample is 

processed by expert k. At the same time, for each n, z„k becomes equal to one only for a single value 
of k and zero for the rest. We are now ready to write down the likelihood for the complete training data 
set/ (y„,z n k), i-e., 

P (y,z- B ,p)=nn(^K*^, 2 )r. (i2 - %) 

n =1 k= 1 

where y is the vector of the output observations and Z the matrix of the respective hidden variables. 
The log-likelihood is readily obtained as 


In p(y, Z\ E, P) = 'Y^Yl, Znk ln (PkN (y n \0lx n ,o;^Y (12.97) 

n= 1 k= 1 

We can now state the steps for the EM algorithm. Starting from some initial conditions, E ■ - 1 , P {0 \ the 
(j + l)th iteration is given by: 

• E-Step: 

Q(H,P; E ( A, P ( A) =E z [ln p(y, Z; E,P)] 

N K 

= E[z nk\ ln (PkAf (y n \0 T k x n , ) . 

n= 1 k= 1 

However, 

E[z nk ] = P(k\y n -, E 

or 

jv a: 1 

Q(E, p; s 1 W , P W ) = £ £ (ln P k -'ln a/- 

/7=1 £=1 

where C is a constant not affecting the optimization, and 

Ynk :=P(k\y n : H^.P^)- 


(12.98) 


(12.99) 


7 Strictly speaking, the data set depends also on x n ; to simplify notation, we only give y n , because this is the one that is treated 
as a random variable, with the input variables being fixed. 
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As expected, (12.99) looks like (12.59). 

• M-Step: This step comprises the computation of the unknown parameters via three different opti- 
mization problems. 

Gating parameters'. Following similar steps as for (12.62), we obtain 


pO+b 

r k 


I N 

N^ 2 Ynk - 


n=l 


( 12 . 100 ) 


Learners' parameters: For each k = 1,2,.... K, we have 

Q(E, P, H W), pG) } = _ ^ (>'„ - 0{x n y + C U (12.101) 

n =1 ^ 

where C\ includes ali terms that do not depend on 0 k . Taking the gradient and equating to zero, we 
readily obtain 

N 

YnkXn (y,i - xl 0k) = 0 , 

n =1 

or, employing the input data matrix Z r :=[jci,...,x^], 

x T r k ( y -xe k ) = o, 


with 


T k :=dmg{y\ k , ...,y Nk }, 


and finally 


o[ j+l) = (x T r k x^ l x T r ky ,k = 


1,2 ,...,K. 


( 12 . 102 ) 


Eq. (12.102) is the solution to a weighted LS problem, similar in form as the one met in Section 7.6, 
while dealing with the logistic regression. Note that the weighting matrix involves the posterior 
probabilities associated with the kth expert. 

Noise variance: We have 


Q(a,P: E w ,P w )=^^y„i(- 

n —1 k=\ 

^2 (?» - *t 0+1) *») 2 ) + C 2’ (12.103) 
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whose optimization with respect to rr~ leads to 


2(7+1) 

a„ 




(12.104) 


Mixture of Experts 

In mixture of experts [24], the gating parameters are expressed in a parametrio form, as functions of 
the input variables x. A common choice is to assume that 


8k(x ) := P k {x) = 


exp ( w l k Jtr) 

E,=i exp( w l x ) 


(12.105) 


Referring to Fig. 12.12, the gating weights are the outputs of the gating network, which is also excited 
by the same inputs as the experts. In the neural networks context, as we will see in Chapter 18, we 
can consider the gating network as a neural network, with activation function given by (12.105), which 
is known as the softmax activation [5]. Note that (12.105) is of exactly the same form as (7.47), used 
in the multiclass logistic regression. Under such a setting, P k in (12.99) is replaced by P k (x), and the 
respective M-step becomes equivalent to optimizing with respect to w k , k = 1,2,..., K. We have 


N K 

Q(E,P; E (j> , P ( J>) = y»k In P k (x) + C 3 . (12.106) 

n =1 k =1 


Observe that (12.106) is of the same form as (7.49), used for the multiclass logistic regression, and 
optimization follows similar steps (see also, for example, [19]). 

A mixture of experts has been used in a number of applications with a typical one being that of 
inverse problems, where from the output, one has to deduce the input. However, in many cases, this 
is a one-to-many task and the mixture of experts is useful to model the choice among these “many” 
options. For example, in [4], mixture of experts is used for tracking people in video recordings, where 
the mapping from the image to pose is not unique, due to occlusion. 


Hierarchical Mixture of Experts 

A direct generalization of the mixture of experts concept is to add more levels of gating functions in 
a hierarchical fashion, giving rise to what is known as a hierarchial mixture of experts (HME). The 
idea is illustrated in the block diagram of Fig. 12.13. This architecture resembles that of trees, having 
the experts as leaves, the gating networks as nonterminal nodes, and the output (summing) node as 
the root one. A hierarchical mixture of experts divides the space into a nested set of regions, with 
the information combined among the experts under the control of the hierarchically placed gating 
networks. This hierarchy conforms with the more general idea of conquer and divide strategies. 

Compared to decision trees, an HME evolves around soft decision rules, in contrast to the hard ones 
that are employed in CARTs. A hard decision, usually, leads to a loss of information. Once a decision 
is taken, it cannot change later on. In contrast, soft decision rules provide the luxury to the network to 
preserve information until a final decision is taken. For example, according to a hard decision rule, if a 
sample is located close to a decision surface, it will be labeled according to the label on which side it 
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FIGURE 12.13 

A block diagram of a hierarchical mixture of experts with two levels of hierarchy. 


lies. However, in a soft decision rule, the information, related to the position of the point with respect 
to the decision surface, will be retained until the stage at which the final decision must be made, by 
taking into consideration more information that becomes available as the processing develops. 

Note that training a mixture of experts can also be carried out via a different path, by optimizing a 
cost function without it being necessary to employ probabilistic arguments (see, for example, [19]). 

12.9.2 MIXING LOGISTIC REGRESSION MODELS 

Following Section 12.9.1, the combination rationale can also be applied to classification tasks. To this 
end, we employ the two-class logistic regression model for each one of the experts, and the combination 
rule, given the input value jc, is now written as 


K 

p(y ; E,P) = ^fl l s^(i-j Jt ) 1 -y, (12.107) 

k=\ 

where the definition of logistic regression from Section 7.6 has been used, the values of the label 
y e {0, 1} correspond to the two classes a>\ and « 2 , respectively, and 


s k '■= cr(OfrX) 


( 12 . 108 ) 
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denotes the output of the Ath expert. As in Section 12.9.1, 3 is the set of the unknown parameters and 
P the corresponding set of the gating network. Mobilizing similar arguments as for the case of linear 
regression, we can easily state that the likelihood of the complete data set is given by 

N K 

P(y , Z; E, P) = n n -^) 1 “- v ")' ,, \ (12.109) 

n— 1 k =1 

where s n k cr(0l x n ) and y is the set of labeis, y n , n = 1,2,.... /V, of the training samples. Following 
the Standard arguments of the EM algorithm applied on the respective log-likelihood function, it is 
readily shown that the E-step at the j th iteration is given by 


N K 

<2(2, P; 3°'\ P&) = J2Hynk( ] " p k + yn lns„* + (1 - y„)ln(l (12.110) 

n= 1 k =1 


where 


Yhk = Etz»*] = P(k\y„, 3 ( J\ P (p ) = 


P k j) Sn k V-Snk) l - y " 


Zt 


P. <j) s y " 


(1 




( 12 . 111 ) 


Note that in (12.111), the notation s ( J ^, s ( n P should have been used, but we tried to unclutter it 
slightly. 

In the M-step, minimization with respect to P is of the same form as it was for the regression task, 
and it leads to 


pO+l) 

r k 


I N 

N^Z y,,k - 


n= 1 


( 12 . 112 ) 


To obtain the parameters for the experts, one has to resort to an iterative scheme. Observe that the only 
differences of (12.110) with (7.38) are (a) the presence of the term involving P'k, (b) the summation 
over k, and (c) the existence of the multiplicative factors y n k- The first two make no difference in the 
optimization with respect to a single 0k and the latter is just a constant. Hence the optimization is 
similar to the one used for the two-class logistic regression in Section 7.6, with the gradient and the 
Hessian matrices being the same except for the multiplicative factors (and the sign, because there, the 
negative log-likelihood was considered). The extension to the multiclass case is straightforward and 
follows similar steps. 


Example 12.7. This example demonstrates the application of a mixture of two linear regression models 
to a synthetic data set. The input and the output are scalars, x n and y„. Fig. 12.14A shows the setup. 
The data reside in different parts of the input space and in each region the input-output relation is 
of a different form. The goal is to estimate the two linear functions, 0\±x + %/,, k e {1,2}. The EM 
algorithm of Section 12.9.1 was initialized with the true value of the noise variance cr“. 

Fig. 12.14A-C shows the resulting linear models after the first, the seventh, and finally the 15th 
iteration. Fig. 12.14D-F shows the resulting posteriors P(k\y n ,x„) (measured by the length of the 
bar) associated with each learner as a function of x n . After convergence, they are of a bimodal nature, 
depending on where each sample resides in the input space. In this way, significant probability mass 
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(D) 


(E) 


(F) 


FIGURE 12.14 

The two fitted lines as estimated by the EM algorithm after (A) the first, (B) the seventh, and (C) the 15th iterations. 
Figures (D)-(F) show the corresponding posterior probabilities, for each one of the training points x„. The length of 
each segment is equal to the value of the respective probability. 


is assigned even to regions where data points do not exist. A smoother and more accurate, from a 
generalization point of view, estimate results if we let the gating parameters be functions of the input 
variables themselves. 


PROBLEMS 

12.1 Show that if 


p(z)=Af(z\n z , Z z ) 


and 


p(t\z)=Af(t\Az, Xqz), 


then 


E[z\t] = (e ~ 1 + a t z-}A)-\A T z~}t + 
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12.2 Let x e K 7 be arandom vector following the normal Af(x\/i, X). Consider x n , n — 1,2,..., N, 
to be i.i.d. observations. If the prior for fi follows Af Xq), show that the posterior 

p(fi |jc i ,... ,xn) is normal X) with 


X~ l = Sq 1 


■NX~ 


and 


jl — X(Xq fly + N X x ), 


where x = Y E«=i x n- 

12.3 If X is the set of observed variables and X 1 the set of the corresponding latent ones, show that 


dlnp(X\$) 


= E 


3 ln p(X,X l ;$) 


where E[-] is with respect to p(X'\X\ i;) and ^ is an unknown vector parameter. Note that if 
one fixes the value of | in p(X l \X\ |), then one has obtained the M-step of the EM algorithm. 

12.4 Show Eq. (12.42). 

12.5 Let y e R^, let 0el f , and let O bea matrix of appropriate dimensions. Derive the expected 
value of || _y — 4>0|| 2 with respect to 0, given E[0] and the corresponding covariance matrix Xg. 

12.6 Derive recursions (12.60)— ( 12.62). 

12.7 Show that the Kullback-Leibler divergence, KL (p || q), is a nonnegative quantity. 

Hint: Recall that ln(-) is a concave function and use Jensen’s inequality, i.e., 


/ 


/ / g(x)p(x)dx 


<-!■ 


f(g(x))p(x)dx, 


where p(x) is a PDF and / is a convex function. 

12.8 Prove that the binomial and beta distributions are conjugate pairs with respect to the mean 
value. 

12.9 Show that the normalizing constant C in the Dirichlet PDF 


Dir(x|a) = C ]~[ x“ l: , ^2, Xk ~ * 


k= 1 


k =1 


is given by 


C = 


r(ai + 02 + ■ ■ ■ + cik ) 

r(ai)r(a 2 )...r(a K )' 


Hint: Use the property r(a + 1) = aV(a). 

(a) Use induction. Because the proposition is true for k — 2 (beta distribution), assume that it 
is true for k—K— 1, and prove that it will be true for k — K. 
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(b) Note that due to the constraint x* = 1, only K — I of the variables are independent. 
So, basically, the Dirichlet PDF implies that 

K-\ / K -1 \ a X-l 

p(x 1,X2, ...,x K - 1) = c ]""[ x a k k ~ l 1 1 - ^2 Xk | 
k= 1 V k=l ! 


12.10 Show that AT(x\fi, E ) for known E is of an exponential form and that its conjugate prior is 
also Gaussian. 

12.11 Show that the conjugate prior of the multivariate Gaussian with respect to the precision matrix 
Q is a Wishart distribution. 

12.12 Show that the conjugate prior of the univariate Gaussian Af(x |/x, a 2 ) with respect to the mean 
and the precision fi = \ , is the Gaussian-gamma product 


p(fi, /5; k, v) = J\f (ja Gamma (fi 


k + 1 tq 
2 ’ ~2 



where v := [iq, iq] r . 

12.13 Show that the multivariate Gaussian N (x\p, Q~ l ) has as a conjugate prior, with respect to the 
mean and the precision matrix Q, the Gaussian-Wishart product. 

12.14 Show that the distribution 


P(x\p) = p, x { 1 - /x) 1- *, x e {0,1}, 


is of an exponential form and derive its conjugate prior with respect to /x. 

12.15 Show that estimating an unknown PDF by maximizing the respective entropy, subject to a set 
of empirical expectations, results in a PDF that belongs to the exponential family. 


MATLAB® EXERCISES 

12.16 Sample N — 20 equally spaced points x n in the interval [0, 2]. Create the output samples y n 
according to the nonlinear model of Example 12.1, where the noise variance is set equal to 
ct 2 = 0.05. 

(a) Let the parameters of the Gaussian prior be 0 o = [0.2, — 1,0.9, 0.7, —0.2] r and Eg = 
0.1/. Compute the covariance matrix and the mean of the posterior Gaussian distribution 
using Eq. (12.19) and Eq. (12.20), respectively. Then, select randomly K — 20 points xk 
in the interval [0, 2]. Compute the predictions for the mean values /x v and the associated 
variances er 2 , utilizing Eq. (12.22) and Eq. (12.23), respectively. Plot the graph of the true 
function together with the predicted mean values /x v , and use MATLAB®’s “errorbar” 
function to show the confidence intervals on these predictions. 

Repeat the experiment using N — 500 points, and try different values of cr 2 to notice the 
change in the estimated confidence intervals. 

(b) Repeat the previous experiment using a randomly chosen value for 0q and different values 

on the parameters, for example, er 2 = 0.05, cr 2 = 0.15, = 0.1, or rr, 2 = 2, and N = 500 

or N = 20. 
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(c) Repeat the experiment once more using a wrong order for the model, e.g., second- or 
third-order polynomial. Use different values for the parameters than the corresponding 
correct ones for the initialization. See also Example 12.1. 

12.17 Consider Example 12.1 as before. Sample N = 500 equally spaced points x„ in the interval 
[0, 2]. Create the output samples y n according to the nonlinear model of the example, where 
the noise variance is set equal to cr“ = 0.05. Implement the linear regression EM algorithm of 
Section 12.5. Assume the correct number of parameters. Then repeat Example 12.2. After the 
convergence of the EM, sample ten points, Xk, randomly, in the same interval as before and 
compute the predictive means [i y and variances er;?. Plot the true signal curve, the predictive 
means /x v , and the respective conhdence intervals using MATLAB®’s “errorbar” function. Re¬ 
peat the EM run, using different initial values and an incorrect number of parameters. Comment 
on the results. 

12.18 Generate 100 data points from each of the three two-dimensional Gaussian distributions of 
Example 12.3. Plot the data points along with the conhdence ellipsoids for each Gaussian 
with coverage probability 80%. Implement the Gaussian mixture model via the EM algo¬ 
rithm, whose steps are described in Eqs. (12.58)— ( 12.62). Moreover, compute the log-likelihood 
function in every iteration of the EM algorithm using Eq. (12.55). 

(a) In separate figures (always containing the data), plot the ellipsoids of the Gaussian distri¬ 
butions estimated by the EM algorithm during iterations j = 1, j = 5, and j = 30, and the 
log-likelihood function versus the number of iterations. 

(b) Repeat the experiment after bringing the cluster means closer together. Compare the re¬ 
sults. 

12.19 Generate 100 data points from each of the two-dimensional Gaussian distributions with param¬ 
eters 

fi\ = [0.9, 1.02] r , nl = [—1.2, —1.3] r 

and 

0.4 0.02' 

0.02 0.3 _ ' 

Plot the data points using different colors for the two Gaussian distributions. Implement the 
k-means algorithm presented in Algorithm 12.1. 

(a) Run the k-means algorithm for K — 2 and plot the results. Run, also, the Gaussian mix- 
tures EM of the previous exercise and plot the 80% probability conhdence ellipsoids to 
compare the results. 

(b) Now, sample N i = 100 and i\h — 20 points from each distribution and repeat the experi¬ 
ment to reproduce the results of Fig. 12.9. 

(c) Try different conhgurations and play with K different than the true number of clusters. 
Comment on the results. 

(d) Play with different initialization points and also try points which are too far from the true 
mean values of the clusters. Comment on the results. 

1 2.20 Generate 50 equidistant input data points in the interval [— 1, 1]. Assume two linear regression 
models, the hrst with scale 0.005 and intercept — 1 and the second with scale 0.018 and intercept 
1. Generate observations from these two models by using the hrst model for the input points 


= 


0.5 0.081 
0.081 0.7 
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in the interval [—0.5, 0.5], and the second model for the inputs in the interval [—1, —0.5] U 
[0.5, 1]. Also, add Gaussian noise of zero mean and variance 0.01. Next, implement the EM 
algorithm developed in Section 12.9.1. Initialize the noise precision /1 to its true value. For 
iterations 1, 5, and 30, plot the data points and the estimated linear functions Q\±x + Oo.k, 
k e [1,2}, of the models, to reproduce the results of Fig. 12.14. 
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13.1 INTRODUCTION 

This chapter is the second one dedicated to Bayesian learning. The emphasis here, compared to Chap- 
ter 12, is on more advanced topics, dealing with approximate inference methods. Such methods are 
employed when the involved integrations are no longer computationally tractable. Two paths for ap¬ 
proximate inference, known as variational techniques, are discussed. One is based on the mean field 
approximation and the lower bound interpretation of the EM, and the other on convex duality and 
variational bounds. Regression and mixture modeling are discussed in this framework. Emphasis is 
given to sparse Bayesian modeling techniques and hierarchical Bayesian models. The relevance vector 
machine framework is presented. Expectation propagation is also discussed as an alternative to varia¬ 
tional methods for approximate inference. At the end of the chapter, Bayesian learning in the context 
of nonparametric models is presented, including Dirichlet processes (DPs), the Chinese restaurant pro- 
cess (CRP), the Indian buffet process (IBP), and Gaussian processes. Finally, a case study concerning 
hyperspectral imaging is presented. 


13.2 VARIATIONAL APPROXIMATION IN BAYESIAN LEARNING 

Recall that in order to apply the EM algorithm, the functional form of the posterior of the latent/hid- 
den variables, given the observations, must be known. However, analytic computations related to the 
posterior are not always tractable. In such cases, the EM algorithm, in its Standard form as discussed in 
the previous chapter, is not applicable. In this section, we will describe an alternative path that builds 
upon the EM interpretation given in Section 12.7. 

Once more, we will adopt a general notation, which can then be adapted to the needs of specific 
problems. Let X = {x i,..., x;v} be the set of observed variables and X 1 = {x \,..., x l N } the set of the 
N corresponding (local) latent ones, as they have been defined in Section 12.4. Furthermore, in the 
current section, in addition to the latent variables, we will explicitly bring into the game a set of K 
(global) hidden random parameters, 0 e M^, whose number is fixed and which will be accompanied 
by a prior PDF. The complete likelihood function is now written p(X, X 1 , 0\ £), where § is the set of 
unknown nonrandom (hyper)parameters that have to be estimated. As is always the case in Bayesian 
learning, the goal is to infer the posterior probability distributions that describe the latent as well as the 
hidden random variables. 
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The functional in Eq. (12.63) is now redefined as 


■F(?,€) = / 



(13.1) 


The counterpart of Eq. (12.66) becomes 



(13.2) 


which leads to 


lnp(*;$) = .F(?,€) + KL 


(q(X l J)\\p(X l ,0\X^)) 


The difference with Eq. (12.67) lies in the fact that in our current setting the posterior, p(X l , 0\X: §), 
is not assumed to be known; thus, setting the Kullback-Leibler (KL) divergence, KL(c/(T' / , 0)\\p(X l , 
0\X; £)), to zero, in order to maximize the value of the functional, is no longer possible. The only 
alternative is to resort to computationally tractable approximations concerning the functional form 
of q, which will allow the optimization of the functional with respect to q (and with respect to the 
nonrandom parameters, §). 

Optimizing a functional with respect to a function is known in mathematics as calculus of varia¬ 
tioris. The simplest example of this problem is to compute the geodesic that connects two points on 
a surface. Two names whose contributions are considered significant breakthroughs that Consolidated 
this field are the Swiss German mathematician Leonhard Euler (1707-1783) and the Italian-born math- 
ematician and astronomer Joseph-Louis Lagrange (1736-1813). It is interesting to note that Lagrange 
succeeded Euler as director of mathematics in the Prussian Academy of Sciences in Berlin. 

In order to deal with the current problem, we will constrain qiX 1 .0) to lie within a specific fam- 
ily of functions. Note that in this case, if the unknown piX 1 .0\X\ §) does not belong to the selected 
family of functions, the KL divergence cannot become zero, and the lower bound, T(q. £), of the 
marginal log-likelihood cannot be made tight. This is the reason the method is called a variational 
approximation. 


THE MEAN FIELD APPROXIMATION 


This type of approximation results by constraining qiX 1 . 0) to be factorized, that is, 

q{X l ,0) = q x i{X l )qm. 


(13.3) 


This factorization can be, and usually is, extended to 


q(X l .0) = </ x / (x [)... qj (x l N )qo(0): mean field approximation. 


(13.4) 


Furthermore, the hidden variables can be further factorized, i.e., q$(0) = Y[f < 70 , (di)- This type of fac¬ 
torized function approximation has been inspired from the field of statistical physics and is known as 
mean field approximation (e.g., [17,46,60]). No doubt, a number of combinations that group different 
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variables together can also be used. To simplify the notation, without sacrificing generality, we will 
work with Eq. (13.3), which involves only two factors. 

Having adopted a factorized model as in Eq. (13.3), Eq. (13.1) becomes (Problem 13.1) 


T(q x uq, 



q X l(X l )(j q*{6)\np(X,X',6-l;)d6^ 
I q x i (X 1 ) In q x i (X 1 ) dX l — J q 9 (0)d6. 


dX l 


which can alternatively be written as 

-A(c/A’'> 4 e, §) = Eg xl E ?e [ln p(X, X 1 , 0; £)] + H qxl + H qe . 


(13.5) 


(13.6) 


Taking into account that the order of integration in (13.5) can be interchanged, we can also write 


T(q x ,,q^S-) = E q , Eq xl [ln p(X,X l ,9-,$)]+H qxl +H qe , (13.7) 

where the last two terms on the right-hand side are the entropies associated with the two distribu- 
tion terms, q x i and < 79 . Note that Eqs. (13.6) and (13.7) are direct generalizations of Eq. (12.65) in 
Chapter 12. 

The lower bound functional T(q x i , < 79 . £) depends on three terms. Following the alternating op- 
timization rationale, in every iteration step we will freeze two of them and maximization will be 
performed with respect to the remaining one, in an alternating fashion. Optimizing the lower bound 
functional with respect to a distribution will take place by setting to zero a corresponding and appro- 
priately dehned KL divergence. To this end, for example, let us rewrite Eq. (13.6) in a slightly different 
and more convenient form. Dehne the quantity p as 


E 99 [ln p(X, X',9\ $)] := ln p{X, X l \ §). 


(13.8) 


Then Eq. (13.6) (or, equivalently, (13.5)) is written as 


nq X uq*%)= f q Xl (X>)\n PiX ’*^ ) dX l +H qg . (13.9) 

J q x i (X ) 

Observe that the hrst term on the right-hand side has the form of a KL divergence, where the hid- 
den variables, 0, have been averaged out. Note, however, that by its defmition, p is not necessarily a 
distribution. As we will soon see, in order to become a distribution, a normalizing constant is required. 

We have by now ali the ingredients to go ahead with the maximization of J-(q x i , < 79 , §). The opti- 
mization will take place by maximizing hrst with respect to q x i, in the sequel with respect to < 79 , and 
finally with respect to §. The algorithm is initialized to some arbitrary values for as well as for 
< 7 g 0) . The latter is achieved by initializing parameters (statistics) related to q^ 1 (this will become more 
ciear while dealing with the examples). The ( j + l)th iteration comprises the following steps: 
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E-Step la: Holding § <J| and <A 7) fixed, optimize Eq. (13.9) with respect to q x r, that is, 

q^\X l ) = maXq xl X (q x i , q ^, £ 0) ) 

f , p(x,X';S<») 

= max„ , / ^v/(A’ / )ln- - - — dX + constant, (13.10) 

* J qx'( x ) 

where “constant” contains all the terms that do not depend on X 1 . The negative KL divergence in 
Eq. (13.10) is maximized if we set 


q%t l \x l )<xp{X,X 1 -^), 

where oc denotes proportionality. Combining Eqs. (13.11) and (13.8), we can now write 


exp 

Y l ) — 

( E «.°> 

[ln p{X,X'\9-,^) 

) 

q x , (A. ) — 

J exp 

V [ 

lnp(X,X‘\Q-^)J^ 

dX’’ 


(13.11) 


(13.12) 


where the proportionality constant has, necessarily, been replaced by the normalizing factor to 
guarantee that q x i is a distribution. For the specific forni in (13.12), the Bayes theorem, i.e., 
p(X, X 1 ,0\ S- { ’ } ) — p{X, X l \0\ ^ ( ^)p{0\ has been employed in both the numerator and the de¬ 
nominator. 


E-Step lb: In this step, §^ and q x t^ are frozen. Following similar steps as before (repeat the steps as 
an exercise), starting from the formulation in Eq. (13.7) and maximizing with respect to < 79 , we obtain 


p(0; § (j) )exp 

fl O'+l) 091 

0 + 1 ) 
V q x' 

[lnp(X, X l \0\i- {i) ) 

) 

c h \V) — 

J p(0;% u) )ex p| 

E ,'p" 

ln p(X, X l \0\ § (/) ) 

| dO 


Steps la and lb comprise the E-step of the variational Bayesian EM. 

M-Step 2: Freezing q$ + ^ and q x + l \ maximize the lowerbound with respect to §, that is. 


= argmax T (q x ? 1 ^, q$ +1 '*, ■ 


(13.14) 


The counterpart of the EM illustration of Fig. 12.10 is given inFig. 13.1. There are two observations 
to be made. Step 1 is now split into two parts, and more importantly, the KL divergence does not (in 
general) go to zero; hence the bound does not become tight. This comprises the E-step of the variational 
Bayesian EM. 
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lnptt +1 > 


Inp(^) 


lnp^ 


lnp(j) 







I 

^, +1) ,^ +1> .« w+1) ) I *) 


N 


1 ( 



i 

( 



KL( 9 «+ 1 >, g « +1 >ii J , ( j , ; ) ) 


Step la 


Step lb 


Step 2 


FIGURE 13.1 

Illustration of the stepwise increase of ln pU' at the (j + 1 )th iteration of the variational Bayesian EM algorithm. 
Observe that lnf> ( ^ +1) > ln p^\ where we have used the notation pW — p(X ; §and p^’ := p(X ! , 8\X\ 


If there are more than two factors in q ( X 1 , 6 ), as in Eq. ( 13 .4), then there are more than two substeps 
in step 1. and each time we estimate one of the factors by averaging In p(X, X , 6; lj) with respect to 
the rest. Let q be factorized in M factors, 


where for notational uniformity we have not differentiated between parameters and latent variables and 
the dependence on § has been suppressed. Then the general form of update becomes 


ln q m (X l m ) = ¥.\\n p{X, X[,..., X l M )^ + constant, 


(13.15) 


where the expectation is with respect to n^li r q r (X l r ). 

Remarks 13.1. 

• Note that q(X l , 0) is an estimate of the posterior p(X l , 6 \X) and each one of the factors is the 
respective posterior estimate given the observations X, for example, q$(0) ~ p(0\X). 

• Once q(X',9) is factorized, no additional assumptions on the functional form of q x i and q% are 
made. 

• Note that factorization of a PDF implies independence. Thus, if this is not the case for the data 
at hand, the recovered approximations may not be faithful representations of the underlying data 
structure. Hence, choosing a specific factorization has to be carried out with care. In practice, one 
may have to use a number of alternatives and keep the best one. However, computational complexity 
is the other face of the coin, which one must consider in a tradeoff game. In general, the factorized 
variational approach tends to provide approximations to the posterior PDF that are more compact 
than the true ones (e.g., [46]). 

• Recall our discussion in Section 12.3 related to model selection and OccanTs rule. This was a 
kick-off point for our efforts to maximize the evidence with respect to different models, in order 
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to achieve the complexity-accuracy tradeoff. However, having resorted to approximate Solutions 
(even if we forget convergence to local maxima, which somehow can be bypassed by using dif¬ 
ferent initializations) we do not maximize the evidence but a lower bound of it; the latter is not, 
in general, tight. How tight it is depends on the KL divergence, which, unfortunately, cannot be 
trivially computed. Hence, if the lower bound is used for model selection, it has to be treated with 
caution [7]. 

• The variational approximation to Bayesian inference was first proposed in [25] and later on used in 
a number of areas, ranging from machine learning to decoding (e.g., [7,26,31-33,35]). 

• Online versions: An Online version of the variational Bayes algorithm was first proposed in [71]. 
There, the exponential family has been employed to show that parameter updating via the variational 
Bayes philosophy is equivalent to a natural gradient descent method (see Section 8.12 for the natural 
gradient) with step size equal to one. This equivalence is further discussed in [28], where, similarly, 
a stochastic approximation algorithm is proposed in order to process chunks of data in parallel. An 
online variational Bayes algorithm for parameter estimation in the context of sparse linear regression 
modeling has also been proposed in [83]. 


13.2.1 THE CASE OF THE EXPONENTIAL FAMILY OF PROBABILITY DISTRIBUTIONS 

Looking carefully at Eqs. (13.12) and (13.13), it becomes ciear that the practical application of 
the variational Bayesian EM depends on the computational tractability of the expected values of 
ln p(X, X 1 \0; £). Let us now see the form that the iterative steps take when one adopts the PDF models 
from the exponential family. 

Let us assume that the samples (x n , x' n ), n = 1,2,..., N, in the complete data set are i.i.d. Then 

N 

p(X,X t \6) = Y\p{x„,x l „\6). (13.16) 

n =1 


We further assume p(x n , x l n \0) to lie within the exponential family (Section 12.8), that is, 
P(x„,x l „\6) =g(0)f(x n ,x l n )exp(^) T (i 9)u(x n ,x l n )j . 

We also adopt a prior for 0 to be of the respective conjugate form, that is, 

p{0 |k, v) = h(k, v)(g(0)) x exp (V (0)v S j . 


(13.17) 


(13.18) 


The parameters X, v constitute £, which will be considered fixed, because our current emphasis is to 
follow up the specific functional forms that q x i and q% get as iterations progress. So we relax the 
notational dependence on these parameters. 

E-Step la: We have from Eq. (13.12) 


q ( j^ V] (X l ) oc exp (j£ q <J) [ln p(X, X l \0 ) ) 

‘ N 

^lnp(x„,xj ? |0) 


-/ 1=1 


= exp Ejj) 

\ % 
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N 

= 1^ exp (e ? (,) |^In p(x n 

n— 1 



which then suggests that 


q x'„ + 1) ^») ^ eX P ( E 9 « [ ln P(x n ,X l n \Q) ) , 
and combined with Eq. (13.17) results in 

9 4 +1) (*n) = 8f(x n ,x l n )exp ^ u(x n ,x l n )^ 


where g is the respective normalization constant and 


<i P = E U) 




(13.19) 


This is very interesting indeed. Although no functional form was assumed for q%i , it turns out to be a 
member of the exponential family! 

E-Step lb: In a similar way, from Eqs. (13.13), (13.16), and (13.17), we obtain 


<?e /+1) (0) oc P@) ex P (N \ng(0) + ^E^y+D [ ln (/(*„, x' n )) 

n =1 x " 

N 

+ 0 7 ’(^)^E ? o+ 1 ) [ u(x n ,x' n ) y 


n =1 " 

which combined with Eq. (13.18) results in 


N 


, 0 + 1 ) 


(6) oc (g(0)) l+N exp y> J (0) + ^E^o+n [u{x n ,x l n )\ 


(13.20) 


Thus, the approximation qa +1 \0) of the posterior p(0\X) is of the same form as the conjugate prior 
with 

N 

X — X + N, v = v + y^E ?( j+i) j^»(jr„,xj,)] . (13.21) 

n= 1 x " 

Note that Eq. (13.21) is of the same form as (12.77). We only have to average out the hidden variables. 
This is a very elegant resuit, because nothing has been assumed about the functional form of q%. In 
other words, once we adopt the functional form for the PDFs of the complete set as well as that of 
the prior of the parameters to be of the exponential type, then subsequent iterations become a “family 
business.” 








13.3 A VARIATIONAL BAYESIAN APPROACH TO LINEAR REGRESSION 655 


13.3 A VARIATIONAL BAYESIAN APPROACH TO LINEAR REGRESSION 

Once more, let us consider our familiar regression task 

y = $0 + t), y e R N , 0 e R*'. 

In Section 12.5, we treated the case where r) was Gaussian and the prior p(0) was also Gaussian. We 
used the EM in order to optimize the evidence p(y) with respect to the parameters that define the two 
adopted Gaussian PDFs; note that for this case, one could bypass the EM and resort to analytical com- 
putations in order to obtain the evidence and subsequently use an optimization technique to estimate 
the unknown parameters. 

In this section, we will adopt assumptions that do not allow for tractable analytic computations of 
the posterior, p(6\y), which is a prerequisite both for the Standard EM and for the analytic computa¬ 
tions of the evidence p{y). This approach is far from a pedagogic toy and has strong practical flavor. 
We will develop the task in some detail, and the reader is advised to go through the computations, be- 
cause they are typical of what will be encountered in practice, once the variational Bayesian approach 
is chosen for addressing a task. 

Assume that 

p(y\6,p)=N{<$>6,r l I)- (13.22) 

That is, the noise is Gaussian and for simplicity we have considered it to be white, Xj ; = ct“/, and fi = 
3j. In contrast to what we did in Section 12.5, now we will be more democratic and give the freedom to 

a ri 

each one of the parameter components, 0*, to have a different variance, := 2_, k — 0, 1,..., K — 1. 
Moreover, we go one step further. The values of fi and c^-, k = 0,..., K — 1, will not be treated as 
deterministic variables. We will also treat them as random ones, ffl, 04 -), which will be assigned prior 
PDFs; these prior PDFs are in turn controlled by another set of hyperparameters. More specifically, our 
model, in addition to Eq. (13.22), comprises [8] 


K -1 

P(fi\«)= n AA(^|0, «- 1 ), 
k=0 


K -1 

p(a) = n Gamma(a J (.|fl, b), 

k=o 


(13.23) 

(13.24) 


and 


p(P) = Gamma(/S|c, d). (13.25) 

Note that the previous choice of the priors indicates our will to “play” the game within the exponential 
family terrain. The prior p{u) is the conjugate pair of Eq. (13.23) (see Chapter 12). Also, Eq. (13.25) 
would be the conjugate of Eq. (13.22), if we had considered 0 frxed. Fig. 13.2 provides a graphical 
representation of the dependencies among the various variables involved in our model. Arrows indicate 
conditional dependencies. Graphical models will be considered in a formal way in Chapter 15. Note 
that such a model forms various levels of hierarchy in the dependency among the involved parameters. 
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FIGURE 13.2 

A graphical illustration of the dependencies among the various variables involved in the model of linear regression. 
The red circle indicates the random variable that is observed, gray circles indicate hidden random variables, and 
squares correspond to deterministic parameters. The direction of each arrow indicates the direction of the depen- 
dence between the connected variables. The red box indicates that the above dependencies hold for ali N time 
instants. 


This concept of hierarchy is at the heart of what we call hierarchical Bayesian modeling. Each one 
of the involved PDFs is expressed in ternis of certain parameters. Because the values of these param¬ 
eters are unknown, they are also treated as random variables whose priors are expressed in terms of a 
new set of hyperparameters. Each one of them is in tum treated as a random variable associated with 
a new prior, known as hyperprior. This rationale can be extended in order to construet different levels 
of hierarchy. Often, at the higher level of hierarchy, the corresponding (unknown) hyperparameters 
are assigned values by the user, based on experience; for example, the overall model can be relatively 
insensitive to their specific values, which makes the corresponding choice a fairly easy job. 

Our current task comprises hidden variables in the forni of parameters grouped in 0, a, and p and it 
involves no other unobserved variables. The set of observations is now given by y. Also, observe that 
the posterior p(0, a, (1\y ) is not analytically tractable. We will resort to the variational Bayesian EM 
to obtain an estimate of the previous posterior PDF. 

Using the mean field approximation, we assume that the approximation to the posterior (the depen- 
dence on y has been suppressed for notational convenience) factorizes as 

q(0,a,P) = q«(0)q a (a)q(i(P), (13.26) 

where we have relaxed our notation, for simplicity, from the explicit dependence on a, /;, c, and d. We 
will bring them back into the game whenever needed. The variational EM consists of three substeps, 
one for each factor in Eq. (13.26). Starting from some initial guesses, for E[|3], E[a*;],A: = 0,..., 
K — 1, (it will become ciear soon why we need to start with those 1 ) we get: 


If a, b, c, d were not fixed, then one would need initialization for these parameters too. 


1 
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E-Step la: From the general update form of Eq. (13.15), we have 

ln4e /+1> (0) = E o) m ln p(y, 0, a, P)1 + constant, (13.27) 

Ha L J 

where now 

lnp(y,0,a,P) = \n(p(y\9,a,P)p(6,a,P)) 

= ln { P (y\0, P)p(6\«)p(u)p(pj), (13.28) 

where the independence of y on a, given the values 0, has been taken into account. Using Eqs. (13.22) 
and (13.25) and some trivial algebra we get 


pNI 2 

ln p(y, 0, a, P) = ln-rr^- 

(2jt) n / 2 



cD 0|| 2 


1 

2 


K-l 


E ak °k 


K-l 

+ E ln 

k =0 



+ In p(a) + In p(P), 


or 

o | K-l 

ln p(y, 0, a, P) = -||y — O0 || 2 -^ a *# 2 + constant, (13.29) 

Z k=o 

where “constant” includes ali terms that do not depend on 0, because in this step our goal is to estimate 
a function of 0. Expanding Eq. (13.29) and taking expectations with respect to p and a, considering 
q^ } (P) and q„\a) known, we get 

ln^ 2 + 1 ) (0) = E (j) ? o) [lnp(y, 0 , «, P)] + constant = -^E[P]0 7 <b 7 <$> 6 > 

- iE[p]y 7 y + E[P]0 7 <J> 7 y — ^0 T A0 + constant, (13.30) 

where by definition 

A:=diag{E[a 0 ],...,E[ajf_i]}, 
and we have used for notational simplifications 

E[P]:=E m(P] and E[a k ] := E m[oL k \, k = 0, 1, 2,..., K - 1. (13.31) 

Hfi Ha 

It is readily noticed that the right-hand side of Eq. (13.30) is of a quadratic form with respect to 0, 
and hence q^ +]> (9) is Gaussian; in order to completely specify it, it suffices to compute the respective 
mean and covariance (precision) matrix. 

Reshuffling the terms in Eq. (13.30), we get 

ln^ /+ 1 ) (0) = -^0 7 (A +E[P]cD 7 ’cD) 6 » + E[p]0 7 ’<t> 7 ’y + constant, 
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which according to Eqs. (12.114), (12.116), and (12.117) of the appendix of the previous chapter 
(available via the book’s site) results in 


$ +l \o)=N(e\$ +x \4 j+l) ). 

(13.32) 

£ { g j+1) = (A + EtPlO 7 ^ \ 

(13.33) 

^ +1) =E[p]r^ +1) <I. r y. 

(13.34) 


During the first iteration step, E[p] and !£[«*.] are provided by their initial values. For the subsequent 
iterations, they have to be obtained together with q^ n and qj 1 . Note that the approximation to the 
posterior p(0\y) (Eq. (13.32)) turns out to be Gaussian, although we did not assume it to be so. This 
is a consequence of the particular form of the adopted PDFs, which spring forth from the exponential 
family. 

E-Step lb: We have 

lng,x /+1) (a) = E (j+i) o) [lnp(y, 0, a, P)1 + constant (13.35) 

% 

= E (j+i) (j) rinp(0|a) + ln/;>(«)] + constant, (13.36) 

if, L J 

where the constant contains all terms that do not depend on a. Because no term in the bracket in the 
right-hand side of Eq. (13.36) depends on p, we have 


ln^ +1) (a)=E u+i) 
% 


‘ K -1 1 K -1 

2 Z lnak _ 2 E akQ l 

. k =0 Z k =0 


ln p (a) + constant. (13.37) 

Taking into account Eq. (13.24) and after some algebra (Problem 13.2), we obtain 


where 


K -1 

q„ +1 \u ) = Y [ Gamma(afc|a, bk), 
k =o 


(13.38) 


a = a+~, (13.39) 

2 

b k = b+\ E^ 0+ i, [0|], k = 0,K — 1. (13.40) 

2 ‘/e 


In order to compute E[0^], recall Eqs. (12.46) and (12.47) and apply them into our setting to give 


E 0+1)100' ] = K 


+n vy 


or 
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E[0£] = [e o + i) [0e r ]l =r^ +1) + ^ +1) ^ +1)T l , A: = 0,1,..., A' — 1, (13.41) 

L ^0 ikk L Jtt 

where [A\ kk denotes the (k, k) element of a matrix A. To complete the computations, we have to 
compute E[oifc], k = 0,1,..., K — 1, to be used during the next iteration in Eq. (13.33). However, 
because each a./,- follows a gamma distribution, we know that (Section 2.3.2) 

E o+D tot*] = — ■ (13.42) 

q “ b k 


E-Step lc: We have 

lng! ,+1) (A) = E u+i) o+D [ln p(y, 0, a, /1)1 + constant 

P % <1(X L J 

= E^o+i) 9 o+i) [ln/Aj|0, P) + ln p(P)] + constant. 

This is of the same form as Eq. (13.36), and following similar steps (Problem 13.3), it can be shown 
that 


— Gamma(/5|c, d), (13.43) 

N 

c = c+y, (13.44) 

J^+^o+ntllji-OOlI 2 ]. (13.45) 

To compute the expectation in Eq. (13.45), recall Eq. (12.49), which for our needs becomes 

E^o+ntllj- O0|| 2 ] = Hj- $4 ;+1) H 2 + traceJoi: e 0 ' +1) cD r j. (13.46) 


Finally, we have 

Eo+i>[P] = -, (13.47) 

9 f) d 

which completes all the computations associated with the E-step of the variational EM. Note that 
qu +l \a) — p(u\y) and q^ +l \f5) ~ p(P\y) retain the gamma functional form of the corresponding 
priors that were originally adopted, without forcing them to. 

In principle, one can add an extra M-step in the algorithm to maximize the bound with respect to 
the unknown parameters a, /;, c, and d. However, in practice, for computational simplicity these param- 
eters are fixed to very small values, that is ,a = b = c = d= 10 -6 , which correspond to uninformative 
gamma prior distributions, in the sense of giving no preference to any specific range of values. Note 
that for such small values, the gamma distribution falis as -[. Indeed, for a, h ~ 0 

1 

Gamma(x|fl, b) ~ —, x > 0. 
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Because every positive x can be expressed as 

x = exp(z), z = lnx, zeR, 

it can be easily checked (Problem 13.4) that the PDF that describes z is uniform. This is a typical 
procedure in practice; that is, one allows enough levels of hierarchy and fixes the hyperparameters in 
the highest level to define uninformative hyperpriors. 

In summary, the variational Bayesian EM steps are given in Algorithm 13.1. 

Algorithm 13.1 (Variational EM for linear regression). 

• Initialization 

- Select initial values for E[|3], E 9( Jaj;], k = 0,1,..., K — 1. 

• For, j = 1, 2,..., Do 

- A = diag{E (? Jao,],E ? Jai], ...,E 9a [a*_i]}. 

- Compute Eq from Eq. (13.33) and fig from Eq. (13.34). 

- Compute a from Eq. (13.39). 

- Compute^, k — 0, 1,..., K — 1, from Eqs. (13.40) and (13.41). 

- Compute E ?a [oijt], k — 0, 1 ,K — 1, from Eq. (13.42). 

- Compute c from Eq. (13.44) and d from Eqs. (13.45) and (13.46). 

- Compute E^[|3] from Eq. (13.47). 

- If convergence criterion is met, Stop. 

• End For 

Once the algorithm has converged, predictions can be made on the basis of the predictive distribu- 
tion given in Eqs. (12.21 )—(1 (2.23), by replacing Eg\ y , fig\ y , and by the converged values of Eq, fiQ, 
and E[fl], respectively. Note, however, that this is only an approximation, because the Gaussian form 
for the posterior of the parameters is a resuit of the mean field approximation and also we have used 
the mean value, E[|3], in place of the noise variance. The latter can be justified because as the number 
of training samples increases, the distribution of (5 sharply peaks around its mean value [8]. 

C0MPUTATI0N 0F THE L0WER B0UND 

Once the algorithm has converged, the quantities q$(0), q a (ci), qp(/3) are available and the lower bound 
Fiqo^c/, qp) can be computed. The computation of this lower bound can also be done at every iteration 
to check how much it changes from iteration to iteration, and then this can be used as a convergence 
criterion. Let q%. q a , c/f, be the approximate posteriors after convergence, defined by the parameters 
Eq, (ig,d,bk,k — 0, 1,2,..., K — 1, c, and d. The lower bound is then given as 

■F(?e,««,?p) = E w« 9 fs [ln/>(.y,0,a,P)]-E^ e [ln</ e (0)] 

~ E ?or [in^of(a)] - E^ [ln^p(P)]. (13.48) 

Performing the expectations can be a bit tedious, but it is straightforward (Problem 13.5). 
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13.4 A VARIATIONAL BAYESIAN APPROACH TO GAUSSIAN 
MIXTURE MODELING 

Dealing with the Gaussian mixture modeling in Section 12.6, it was pointed out in the remarks that the 
Standard EM approach may lead to singularities. One way to bypass this drawback is to enforce priors 
on the involved parameters and resort to a variational Bayesian philosophy to estimate the quantities 
of interest. The task was first treated in [4] and later in [13]. We will present the latter approach and 
comment on the underlying differences between the two later on. 

Given a set of observations X — {x\,, xn}, the respective PDF model is 

K 

p( X ) = l\N(x\Hk- Qk')’ xeM. 1 . 

k= 1 

The task is to estimate the unknown parameters ( P k , fi k , Qk), k = 1,2,.... K. We already know that 
this is a typical task with latent variables, and the complete set comprises ( x n , k n ), n — 1,2,..., N, 
with k n being the index of the respective mixture, k n = 1, 2,..., K. In Section 12.6, the information 
about each of the latent variables, k n , entered into the problem via the posterior P(k n \x n ) for every 
time instant n, the summation over ali possible values of k n was performed, and hence one could drop 
out the time index. However, in the current context, a different path has to be followed and we have to 
consider the latent variables together with their corresponding time index. To this end, and following 
[13], an auxiliary latent random vector is introduced, z„ e R^, for each observation, n — 1,2 ,..., N. 
Its components take binary values, such as 

K 

Zn k e {0, 1} and ^z„, = l, (13.49) 

k= i 

and they are used as indicators of the respective mixture from which the observation at time n, x„, was 
drawn; that is, if z nk = 1 it indicates that x„ was drawn from the kth distribution. Obviously, 

P(Zn k = V=Pk, 

and for any z n e M. K that satisfies Eq. (13.49), 

K 

P(Zn) = Y\Pk‘ k - (13.50) 

k= 1 

Hence, the probability of occurrence of the set Z = {z i, ..., Zn) is 

N K 

p(Z)=i\Y[ p k nk ’ ( l3 - 51 ) 

n= 1 k=\ 


and in this way, we have described the random nature of the N latent variables using a multinomial 
probability distribution. 
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In the sequel, we will treat the mean values as well as the precision matrices as random quantities 
adopting the following prior PDFs: 


p(n k )=Ar(n k \o,r 1 i) 


and 


p(Qk) = W(Q k \Wo,v 0 ), 

for fixed vq, Wo, and ( J >. That is, the adopted priors are Gaussians for the mean values and Wishart 
PDFs for the precision matrices. We will treat P — [I 1 ],..., Pk] T as deterministic parameters whose 
optimized values are obtained in the M-step. 

Following the philosophy of the variational Bayesian EM, we adopt 

q(Z, n 1:K , Qi :K ) = qz(Z)q [l (fi l:K )qQ(Qi-.K), 

where fi i:K and Q\-k indicate the collections {/t 1; ..., ft K } and {Q i,..., Qk}, respectively. 
Furthermore, observe that the conditional PDF of the observations can now be written as 

N K 

p(X\Z,n l:K , Q l:K ) = nn GaT 1 ))^' . 

n— 1 k= 1 

Fig. 13.3 shows the corresponding graphical model. 

Computational steps of the variational EM for Gaussian mixture modeling 

Initialization: (a) P (0) , (b) E (0) [Q*], (c) E ro)[ln |Q*|], (d) E ( oj[|x A .] :=/t[ 0) , and (e) E (0 )[p-AP.J] := 

1Q 9q 9il 

E ( k 0) + A[ 0, Aa I)7 , k = 1,2,..., K, where | • | denotes the corresponding determinant. 

The (j + l)th iteration consists of the following computations (Problem 13.6): 



FIGURE 13.3 


The graphical model associated with the Gaussian mixture modeling of Section 13.4. 
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E-Step la: 


n rt k 


Pn k 


P k J) ex P E ? 0> t ln 10*11 - ^tracej E q u)[Q k ](x n x 

-x„E q u)[\il ]-E q u)[\L k ]xZ + E^ J) [^|j.[]))), 

n n k 

\^K ’ 

2^k= 1 71 n k 


q ( i +x \z) = nrK* 

n=lk= 1 


E-Step lb: 


N 

Qk = + E o')[Qa-] V An*> 

9q *—' 

n=l 

N 

Pk — Qk E gO) [Qa-] y ' Pn k x n > 
n =1 
K 

<& +V> (Pv.k) = n (/Afcl/i*. GjT 1 ) - 

k= 1 


and adopting Eq. (12.47) to the current needs, 

'E ? u+ 1) [M-*H'jfc ] = %k + PkPk = Qk ' + V-kfrk- 


E-Step lc: 


N 

v k = v + E 
/1=1 

N 

= ^o” 1 + E (*»*» “ kk x l - Xnfrl + E (? o+l,[p. A .p.[]J , 

/ 7=1 

K 

q ( d +l \QuK) = Y\W(Qk\v k ,W k ), 
k— I 


E o+D [Qa] = C a 1E a , 


=ZE 


(¥9 


E^y+D [ ln | Qa- IJ 


+ /ln2 + ln | W* |, 


3 ^ 
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where \[r (•) is the digamma function, defined as 


i Ha) := 


d ln T (a) 
da 


and the gamma function has been defined in Eq. (2.91). 
M-Step 2: We have 


pU+l) 

r k 


1 

N 


N 

1 2p*k- 

n= 1 


The previous steps have concluded the algorithm. Observe that the iterations retain the functional 
form of the PDFs that were adopted for the respective priors; this is a consequence of their exponential 
family origin. In [13], it is suggested that this procedure can also be used to determine the number 
of mixtures, instead of adopting a cross-validation technique, as pointed out in the remarks of Sec- 
tion 12.6. By adopting a large enough value for K , the probabilities l\ associated with the irrelevant 
components will be driven to zero during the M-step. Note that such a modeling is possible in the 
Bayesian framework, because it automatically achieves a tradeoff between model complexity and data 
fitting. In [4], the probabilities Pk, k= 1,2 ,.K, were considered as random variables and a Dirich- 
let prior was also imposed on them (Problem 13.7). However, such priors need to be selected with some 
care; otherwise it may affect the sparsification potential of the algorithm (e.g., [9]). 


Example 13.1. The purpose of this example is to demonstrate the power of the variational Bayesian 
method for mixture modeling compared to the more classical EM algorithm, which was discussed 
in Section 12.6. Five clusters of data were generated using a corresponding number of Gaussians, as 
shown in Fig. 13.4. The parameters used for each one of these Gaussians were 

/M-! = [—2.5, 2.5] 7 ’, /i 2 = [-4.0,—2.0] 7 ’, /i 3 = [2.0,-EO] 7 ", 

//. 4 = [0.1,0.2] r , fi 5 — [3.0, 3.0] r 

and 

0.6 0.531 ' 

0.531 0.9 ’ 


r, = 




0.5 0.081 
0.081 0.7 

0.5 0.22 
0.22 0.8 


^2 = 


0.4 0.02 
0.002 0.3 


£ 5 = 


0.88 

0.2 


0.2 

0.22 


Prior to running the algorithms, we assumed that we do not know the exact number of mixtures, so a 
number of K = 25 clusters was used, that is, a much larger number than the true one. 

For the EM algorithm, the initial mean values were generated randomly, using a Gaussian AT(fi \ 

0,7) and the respective initial covariance matrices, k= 1,2.25, with random elements, 

making sure that it is positive definite. One way to achieve this is to generate a matrix <t> with random 
elements from J\T( 0, 1) and then form <f> 7 <J>. Another possibility is to start with a diagonal matrix, for 
example, the identity one 7. 
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FIGURE 13.4 

Figure for Example 13.1. (A) The initial (25) Gaussians for the EM algorithm. (B) The final clusters obtained after 
convergence by the EM algorithm. (C) The initial (25) Gaussians for the variational EM. (D) The final Gaussians 
obtained by the variational EM, after convergence. All the curves correspond to the 80% probability regions. Ob- 
serve that the variational EM identifies the five clusters associated with the data; the rest of the mixtures correspond 
to zero probability weights. 


For the variational EM algorithm, the following initial values were used: the mean values 
j and the initial covariance matrices X®, k = 1,2,..., 25, were generated as before. Also, 
E (0)[<2/:] = /, E ( 0 ) [ln | QkW — 1. In both cases, the initial probabilities were set to be equal. 

Observe that the variational EM identifies the five clusters associated with the data; the rest of the 
mixtures correspond to zero probability weights. In contrast, the EM algorithm tries to identify all 25 
mixtures and the resuit is not satisfactory. 


13.5 WHEN BAYESIAN INFERENCE MEETS SPARSITY 

The Bayesian approach to sparsity-aware learning will soon become our major concern. However, we 
will use this subsection to “warm” us up. The close relationship between the use of a prior PDF and the 
regularization of a cost function has already been discussed in Section 12.2.2. There, the adoption of 
a Gaussian prior together with a Gaussian noise for the regression task led to the equivalence of MAP 
with the ridge regression. It will not take a minute to show that the use of a Gaussian model for the 
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noise together with a Laplacian prior for each one of the weights, that is, 

pWk) = ^exp(-A|At|), 

renders MAP equivalent to the l\ norm regularization of the LS cost. For a Bayesian, however, who 
is not interested in cost functions, the secret that lies within the Laplacian prior is hidden in the heavy 
tails of this distribution. This is in contrast to a Gaussian PDF, which has very light tails. In other 
words, the probability that an observation of a Gaussian random variable can take values far frorn its 
mean decreases very fast. For example, the probability of observing variables that deviate from the 
mean by more than 2 a, 3er, 4a, and 5<r are 0.046, 0.003, 6 x 10 , and 6 x 10“ 7 , respectively. That 

is, if we provide a Gaussian prior, we basically inform the learning process to look for values “around” 
the mean; values away from the mean are heavily penalized. However, in sparsity-aware learning this 
would be the wrong information to pass over to our learning mechanism. Assuming the mean of the 
prior to be zero, although we expect most of the components of our parameters to be zero, stili we 
want a few of them to be large. Hence, our prior information should be selected so as to assign small 
(but not too small) probabilities to large values. Hence, to a Bayesian, sparsity-aware learning becomes 
synonymous with imposing heavy-tail priors. Let us now turn back to our current task, and see how this 
brief introduction is related to our model. Our prior PDF, p(0), according to the model of Eqs. (13.23) 
and (13.24), is obtained by marginalizing out the hyperparameters a (Problem 13.10), that is, 

p(0;a,b) = J p(6\a)p(a) da 

r K -1 

= / ]~~[ J\f(0k\0, a'^" 1 )Gamma(a*|a,ft)^a 

k =0 
K -1 

= n st (^l°’7’ 2a )’ (13-52) 


where st(x|/r, X, v) is the Student’s t PDF, defined by 


r(tti) 

st(x\n,X,v) = 

1 \2' 


X 

7TV 


1/2 


( 1 + M t^) 


v+l 

2 


(13.53) 


The parameter v is known as the number of degrees of freedom. Fig. 13.5 shows the graph of Student’s 
t PDFs for different values of v. For v —> oo, the Student’s t distribution tends to a Gaussian of the 
same mean and precision X. Observe the heavy-tail feature of Student’s t PDF, especially for low values 
of v. Recall that in our case, where we have used uninformative hyperpriors, the hyperparameter, a, 
was given a small value. Thus, our treatment in this section favors sparse Solutions for the regression 
model. It will push as many of the coefficients 0* as possible toward zero. That is, it prunes the less 
relevant basis functions (pk(x) by setting the corresponding coefficients to zero. This is also the reason 
for using different hyperparameters a* for each one of the parameters 0*, k — 0, 2,..., K — 1, which 
provide more freedom to the learning procedure to adjust each one of the parameters individually. In 
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X 


FIGURE 13.5 

Observe that for low values of the degrees of freedom, v, Student’s t PDF has very high tails. In contrast, the Gaus- 
sian PDF is a low-tailed PDF. 


the earlier days, this approach was coined automatic relevance determination (ARD) [47,52,54]. An 
interesting discussion relating adaptive regularization and pruning is provided in [24]. 

Fig. 13.6A provides a ciear demonstration of the sparsity imposing properties of the Student’s t 
distribution. In the two-dimensional space, and as we move away from zero, probability mass is skewed 
toward the coordinate axes; that is, the PDF peaks around sparse Solutions and sparsity is now enforced 
probabilistically. In contrast, the Gaussian does not give much chance to large values (see Fig. 13.6B). 


13.6 SPARSE BAYESIAN LEARNING (SBL) 

In Section 13.3, the prior for each one of the unknown parameters 0%, k = 0, 1,..., K — 1, was given 
the liberty to have their own variances, aj; := A-. In turn, these variances were treated as hidden random 
variables and a prior was assigned to each of them in terms of a number of hyperparameters. 

In [85,92], the model was slightly modihed. The concept of using different variances for the priors 
was retained, but the variances were treated as deterministic parameters and not as random ones. In 
this context, the task becomes a generalization of the one treated in Section 12.5, and it is built upon 
the following assumptions: 


p(y\0-p)=M(.y\<S>6,r l I), (13.54) 

p(6\ a) = A/’(0|O, A -1 ), (13.55) 

where 

A := diag {ao, • • •, a K - 1 }. (13.56) 


2 A slightly different yet equivalent view, employing uniform priors and using the respective modes instead of marginalizing 
out the variances, is followed in [ 85 ]. 
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FIGURE 13.6 

(A) The Student’s t peaks sharply around zero and falis slowly along the axes; hence, sparse Solutions are favored. 

(B) The Gaussian peaks around zero and decays very fast, along all directions. 


The precision /1 is also treated as an unknown deterministic parameter. Our goal is (a) to obtain es- 

timates for fi and ak = 0, 1. K — 1, and (b) to compute the predictive distribution, p(y\x, y), 

where y is the vector of observations. To this end, one could adopt the EM algorithm and follow 


3 


Notational dependence on the input training data, X, has been suppressed. 
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similar steps as in Section 12.5; the only difference is that there, a common variance was shared by 
all the involved prior PDFs. The method is usually referred to as sparse Bayesian learning (SBL) and 
complies with the ARD rationale discussed in Section 13.5. 

In this section, we will adopt a different path, exploiting the Gaussian nature of the involved PDFs. 
A Type II maximum likelihood method will be employed, which was introduced in Remarks 12.2. 
Type II likelihood is defined as the marginal one, after integrating out the parameters 6. Following the 
discussion in Section 12.2 and for our current needs, Eq. (12.15) is written as 

p(y\ a, P)=Af(y\0, p~ l I + OA“ 1 cF 7 ’). (13.57) 

Also, for the sake of completeness, Eqs. (12.16), (12.17), and (12.10) take the form 

p(9\y,a.p)=Af(0\iL,E-,et,P), (13.58) 

with 

fi = pZ® T y, E = (A + 

The objective now becomes to maximize with respect to , k = 0,..., K 
L(a, fi) : = \np(y;a, fi) 

= -— ln(2jr)- -ln| y 6“'/ + cl>A“ 1 O r | 

- ^y T (/6“ i / + cDA“ 1 <1> 7 ’) V (13.60) 

Maximizing the above cost cannot be carried out analytically, and the following iterative scheme is 
derived (Problem 1 3.11, the proof is a bit tedious): 


(13.59) 

— 1, and p the cost function 


Yk 


= 1 


„ (old) 


r (old) 
^kk ’ 


(new) 

a k 


Yk 

(4° ld) ) 2 ’ 


k = 0,l,--.,K -l. 


pnew) _ N ^2k=0 Yk 

_ ||y - <Djw. (new )|| 2 ' 


(13.61) 

(13.62) 

(13.63) 


The iterative scheme is initialized by an arbitrary set of values and it is repeated until a convergence 
criterion is met; Ekk is the respective diagonal element of the matrix E. Note that both E and fi depend 
on the values of p and ctk . The main complexity per iteration step is due to the matrix inversion involved 
in the respective definition in (13.59), which amounts to O(K^) operations. Moreover, because a matrix 
inversion is involved, one must take care of near singularities, due to numerical errors. This can be the 
case in practice, because some of the values of may become very large. Thus, care must be taken 
so that once such values occur, one removes the corresponding columns in <J> and sets the respective 
values of 9k to zero. As a matter of fact, this is how sparsity is enforced by the method. Parameters 
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with mean value equal to zero and a variance that becomes very small (precision very large) are set to 
zero. This behavior has empirically been observed in practice. 

The alternative path to deal with the method is via the EM algorithm. This leads to an equivalent set 
of recursions [85], but practical experience has shown that the previously given set of updates converge 
faster. 

Extensions of the SBL framework in the context of block sparsity and for the case of multiple mea- 
surement vectors (MMVs), when elements in each nonzero row of the solution matrix are temporally 
correlated, are reported in [96,97]. Moreover, in the latter one, a theoretical analysis is provided, which 
shows that the SBL cost function has the very desirable property that its global minimum coincides 
with the sparsest solution to the MMV problem. 

Example 13.2. The goal of this example is to demonstrate the comparative performance, via a sim- 
ulation example, of (a) the variational Bayesian method, (b) the maximum likelihood/LS (12.6), and 
(c) the EM algorithm of Section 12.5 in the context of linear regression and in particular in the sparse 
modeling framework. The SBL method gave results very similar to the variational approach, and it 
is not discussed any further. To this end, we generated the training data according to the following 
scenario. 

The interval in the real axis [—10, 10] was sampled at N = 100 equidistant points x n , n= 1,2,..., 
100. The training data comprise the pairs ( y n , x„ ), n — 1, 2,..., N, where 


y n = exp 


1 (x n + 5.8) 2 \ 

2 0.1 ) 


+ exp 


1 (x„ - 2 . 6 ) 2 \ 

2 0.1 ) 


+ rjn, 


where rj n are i.i.d. zero mean Gaussian noise samples of variance a~ = 0.015. To fit the data, the 
following model was adopted: 


N 

y = J2° k exp 

k=l 


1 (x 

2 0.1 ) 


Thus, the matrix O has the following elements: 


[<J>W = exp 


1 (x„ - Xk) 2 \ 

2 0.1 J 


n= 1,2, k=l,2,...,N. 


Note that we have as many parameters as the number of training points. This is in line with the relevance 
vector machine rationale, which will be discussed in Section 13.7. Fig. 13.7 illustrates the results. 
The red full-line curve corresponds to the true function that generates the data. The gray full curve 
corresponds to the model, having plugged in as estimated values 6k the respective posterior mean 
values from Eq. (13.34). The dotted red curve corresponds to the ML solution and the dotted gray 
curve to the EM, where the estimates correspond to the mean of the respective posterior (Eq. (12.44)). 
The performance advantages of the variational approach are obvious, which almost coincide with the 
true one. Observe how the variational Bayesian approach managed to cope with the overfitting and 
pushed most of the parameters to zero values. 
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FIGURE 13.7 

The figure corresponds to the setup of Example 13.2. Observe that the fitting curve obtained via the variational 
method is almost identical to the true one. 


13.6.1 THE SPIKE AND SLAB METHOD 

This is an old technique for imposing sparsity [42,50]. Let us consider our familiar regression model, 


K -1 

y = 0 7 0(x) + n= Y^, Q k4>k(x) + r\. (13.64) 

k =0 

A new set of auxiliary binary indicator variables are introduced, s* e [0, 1}, k = 0,1,..., K — 1. 
Let also the prior imposed on 0 be a Gaussian, p(0) — A/"(0|O, cr 2 /). As the name suggests, the in¬ 
dicator variables control the presence or absence of a parameter in the summation in Eq. (13.64). For 
example, if Sk = 1, then the corresponding parameter Qk is present, and if Sk = 0, then 0^- is removed; 
this is the way sparsity is imposed onto the model. To this end, a joint Bernoulli prior distribution 
(Chapter 2) is adopted for the indicator variables, i.e., 


K -1 

p (*)= (i3 - 65 > 

k =0 

where the parameter 0<P< 1 specifies a prior level of sparsity. This turns out to be equivalent to 
adopting the following prior on the parameters: 


K -1 

p(6) = (skM(6k\0, cr 2 ) + (1 — Sk)8(6k)j : spike and slab prior. 

k =o V 


(13.66) 


The latter is known as the spike and slab prior. The name comes from the fact that if Sk = 0, then a 
“spike” is imposed at the zero and the values Sk = 1 impose a “slab,” because a Gaussian is a broad one 
(for large enough a 2 ). The corresponding posterior is not Gaussian and its computation can be done by 
mobilizing approximate inference techniques, such as variational or Monte Carlo (see, e.g., [29] and 
the references therein). 
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Variants of the basic spike and slab scheme do also exist (see, e.g., [78]). In the latter reference, 
it is shown that one can obtain the classical £o-based sparsity enforcing constraint on the LS criterion 
(Chapter 9) as a limiting case of one of these variants. Such a path provides another connection between 
probabilistic and optimization-based techniques for sparsity. Another connection will be discussed in 
Section 13.9. 


13.7 THE RELEVANCE VECTOR MACHINE FRAMEWORK 


An important aspect of the work in [85] was the introduction of relevance vector machines for regres- 
sion as well as for classification. Inspired by the support vector regression (S VR), which was discussed 
in Chapter 11, a specific regression model was considered, that is, 


N 



(13.67) 


k=l 


In other words, the general regression model of Eq. (12.1) is considered for K — N + 1, where N is 
the number of observations and 


<At(x) — k(x, Xk), 


where «■(-, •) is a kernel function, as defined in Chapter 11, centered at the input observation points, 

Xk, k = 1,2. N. Thus, the number of parameters becomes equal (plus one) to the number of train- 

ing points. 

The task can be treated either via the SBL philosophy or via the employment of the variational ap- 
proximation rationale, in order to impose sparsity. In [85], it is pointed out that the variational Bayesian 
approach is computationally more intensive and in practice it results in mean values for the hyperpa- 
rameters, which are identical to the values obtained by using the SBL approach. 

Inspired by the definition of the support vectors in the S VR, the surviving data points that contribute 
to Eq. (13.67) are called relevance vectors. Also, the kernels to be used in the RVM framework need 
not be symmetric positive definite functions, because the modeling is not necessarily associated to a 
reproducing kernel Hilbert space (RKHS). 


13.7.1 ADOPTING THE LOGISTIC REGRESSION MODEL FOR CLASSIFICATION 


Besides the relevance vector regression, the relevance vector classification was also introduced in [85]. 
Recall that in the support vector machine (SVM) classification, a linear (in an RKHS) classifier was 
designed. The same model is also adopted for the RVM. Given the value of a feature vector, x, classi¬ 
fication is performed according to the sign of the discriminant function, namely, 


N 


f(x) :=6' 0(x) := Oq + ^0*0*(■*:). 


k=l 
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FIGURE 13.8 

The logistic sigmoid function. 


The goal is to obtain an estimate of the parameters 0 in the Bayesian framework; thus, somehow, we 
have to “embed” 0 into a PDF that relates the input-output data. In this vein, a well-known and widely 
used technique is the logistic regression model, which was introduced in Section 7.6. 

According to this model and for a two-class (o>\, o> 2 ) classification task, the posterior probabilities, 
as required by the Bayesian classifier, are modeled as 


and 


1 

P((jo\\x) = -=-: logistic regression model, 

1 + exp (-0 7 0(x)) 


(13.68) 


P(( 02 \x) = 1 - P(,OD\\x). 


(13.69) 


There is more than one reason that justifies such a choice (see, e.g., [46] and Problem 13.12). Multiclass 
generalizations are also possible (e.g., [84] and Chapter 7). 

For the sake of the less familiar reader, let us look at Eq. (13.68) more closely. The graph of the 
function 


o(t) 


I 

1 + exp(-f) ’ 


(13.70) 


known as the logistic sigmoid function, is shown in Fig. 13.8. For t > 0 (0 r 0(*)> 0), P(co i|*)> 2 
and the decision is in favor of u>\. The opposite holds true for t < 0 (0 r 0(*)) < 0). Considering the 
training set (y„, x„),x n eM. * 1 and y„ e [0, 1}, and adopting a Bernoulli distribution for P(y|x), the 
respective likelihood function can be defined as 


P(y\0) = n ( CT ( 9 ‘"(i ~ a (o 1 <t>(x„))^ 

n =1 


(13.71) 


which is the counterpart of Eq. (13.22) for the regression case. We also adopt a Gaussian prior for 0, as 
in Eqs. (13.55) and (13.56). As in the SBL approach, our goal is to maximize the Type II log-likelihood 
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with respect to the unknown parameters, a. However, p(y\0) is no longer Gaussian, and marginalizing 
out 0 cannot be carried out analytically. In [85], the Laplacian approximation is employed and the 
following stepwise procedure is adopted: 

1. Assuming a to be currently available, maximize with respect to 0 the posterior, which by simple 
arguments is easily shown to be 


or equivalently. 


p(0\y,a) 


P(y\0)p(0\a) 

P(y\oc) 


0map = argmaxln (,P(y|0);?(0|a)) 


N 

= argmax 

n =1 

(1 -yn)ln^l -a(0 T <l)(x n )j) - 
^O 1 A0 + constant J, 


(13.72) 


(13.73) 


where A := diag{o?o, «2 .«at}. Maximizing Eq. (13.73) with respect to 0 results in (Prob- 

lem 13.13) 


Omap = A l ® T (y-s), 


(13.74) 


where s := [«i,..., syv] 7 and s n := a[0 l 0(x„)), n = l,2,...,N. 

2. Use ^map and the Laplace approximation method (Section 12.3) to approximate p(0\y,a) by a 
Gaussian centered at #map [45]. Recall from Section 12.3 that the covariance matrix of the approx¬ 
imate Gaussian is given by 


E~ l 


d 2 ln{P(y\0)p(0\a)) 

d0 2 


D=9map 


or (Problem 13.14) 

r _1 = (<t> T T<£> + A), 

where T = diag{fi, U, ■ • •, G'} and 


t n = cr (o T 4>(x n )j (l - a ( 0 7 


0=#MAP 


(13.75) 


3. Having obtained #map and computed E, adapting Eq. (12.37) to our current notation we obtain 

P(y\a) = P{y\0 MM ,)p{0 MKV \a){2Tt)^\E\ 112 . (13.76) 
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FIGURE 13.9 



The decision curve that separates the two classes (red versus gray), which is obtained by the RVM classifier, cor- 
responds to posterior probability values P(a>\\x) = 0.5. The Gaussian kernel was used with a 2 = 3. Only six 
relevance vectors survive—the ones that have been circled. 


Next, maximization of Eq. (13.76) with respect to a provides the updated iteration estimate. Note 
that the first terni of the product on the right-hand side is independent of a. Taking the logarithm 
and maximizing easily results in (Problem 13.15) 

Ap 7 + ^-^ = °- < 13 - 77 > 

Because Xy and #map.a- depend on a, the equation is solved iteratively and results in exactly the 
same scheme as in Eq. (13.62), that is, 


(new) 

a k 


(old) r (old) 

V- ^kk 


/C toid) y 
y MAP.i ) 


The procedure continues until a convergence criterion is met [44,85]. 

As pointed out in [85], although in general the Laplacian local approximation to a Gaussian may 
not be a good one, in the case of the current classihcation task, due to the specihc nature of the adopted 
models, the approximation is expected to provide good accuracy. 

Fig. 13.9 shows the decision curve that results from the RVM method 4 and classihes the points of 
the red/gray classes. The data set is the same as the one used in Example 11.4 of Chapter 11 when deal- 
ing with the SVM classifier. Six points, which have been circled, are the surviving relevance vectors. 
The Gaussian kernel was used with a 2 = 3, which was found to give the best results. Observe that the 
number of support vectors surviving is significantly less compared to the case of SVM of Chapter 11. 


4 


The Software used was that from http://www.miketipping.eom/sparsebayes.htm#software. 
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Remarks 13.2. 

• Compared to SVM (SVR), the RVM machinery presents advantages and disadvantages. The SVM 
approach has the mathematically elegant property of, theoretically, giving a single minimum due to 
the convexity of the associated cost functions. This is not the case for the RVM framework, where 
the involved optimization steps refer to a nonconvex cost. It must be kept in mind that solving 
a nonconvex task, one may have to run the optimization algorithm a number of times, starting 
each time from different initial conditions, because a nonconvex problem can be trapped in a local 
minimum. 

Concerning complexity, the algorithmic steps for the RVM involve the inversion of the Hessian 
matrix, which amounts to 0(N 3 ) complexity. As discussed in Section 11.11, the complexity range 
of the efficient schemes for solving the SVM scales from linear to (approximately) quadratic. Also, 
the memory for the RVM exhibits an 0(N 2 ) dependence on the size of the training set, as opposed 
to a linear dependence in the SVM case. Besides complexity, inverting (big) matrices must be done 
with care in order to avoid numerical instabilities due to possible (near) singularity. Also, in general, 
RVMs need longer training times to converge, compared to SVMs, for similar error rates. 

A fast RVM algorithm has been developed in [86] by analyzing the properties of the marginal 
likelihood. This enables a sequential addition and deletion of candidate basis functions (columns 
of <!>) to monotonically maximize the marginal likelihood. This iterative algorithm operates in a 
constructive manner, until ali relevant basis functions (for which the associated weights are nonzero) 
have been included. If M denotes the number of relevant terms, the complexity amounts to 0(M 3 ), 
which for M < < N is more efficient than the original RVM. 

The main advantage of RVMs is that, in general, they resuit in sparser Solutions compared to 
SVMs for similar levels of generalization errors. This makes the prediction step, after the train¬ 
ing has been completed, more efficient compared to the prediction model resulting from SVM. 
Moreover, SVMs suffer from their dependence on the user dependent hyperparameter, C (e for re- 
gression), and they are generally found by cross-validation, which involves multiple training for 
different values. 

• In [8], a different algorithmic approach has been adopted based on the variational bouncl approxi- 
mation method, to be described next. 


13.8 CONVEX DUALITY AND VARIATIONAL BOUNDS 

In the previous chapter, the Laplacian technique for the approximation of a general PDF by a Gaussian 
was introduced. The driving force behind such an approximation was to benefit from the computation- 
ally friendly nature of the Gaussian PDF. In this section, we will approach this task from a different 
perspective, involving maximization of a lower bound of the PDF at hand with respect to an extra pa- 
rameter, which is introduced into the problem and on which the lower bound depends. Our theoretical 
framework is that of convex duality, a well-known and powerful tool in convex analysis. 

Let a function / : R ; i—» R. The function 
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defined as 

/*($) = max|| r x-/(x)}, (13.78) 

is called the conjugate of f. The domain of the conjugate function consists of all £ e BJ for which the 
maximum is finite. A notable property of the conjugate function is its convexity, this is true whether or 
not / is convex. The convexity is the outcome of the point-wise maximization of a family of (convex 
with respect to §) affine functions [12]. 

Maximizing Eq. (13.78) with respect to x results in a value of x*, such that 

x*:V/(x*) = $, (13.79) 

which leads to the value 

/*($)= -/(**). (13.80) 

Eqs. (13.79) and (13.80) provide the geometric interpretation of the conjugate function. The graph of 
the linear function t- T x defines a hyperplane whose direction is controlled by §; the latter is now equal 
to V/(x*), which defines the direction of the tangent hyperplane of the graph of /(x) at x*. This 
tangent hyperplane is described by 

g(x) = /(x*) + (x - x*) r V/(x*), 


or using Eq. (13.80), 

g(x) = $ r x-/*($). (13.81) 

For x = 0, Eq. (13.81) becomes g(0) = — /*(£). This is illustrated in Fig. 13.10. Thus, /*(£) corre- 
sponds to the displacement that the graph of x has to undergo in order to “touch” that of /(x). 



FIGURE 13.10 

The direction of the line y = tjx that crosses the origin is controlled by § (f = tana). The (negative) value of the 
conjugate function /* at § defines the point where the line y = g(x) cuts the vertical axis; g{x) is formed by trans- 
lating £x until it becomes tangent to / at x*. 


5 


Strictly speaking, one should use the sup instead of max and inf instead of min, throughout. 
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A by-product of ali this turris out to be very useful for us. It can be shown (Problem 13.16) that if / is 
a convex function, then (/*)* = / and in this case we can write 

/W = maxjx r ? -/*($)}. (13.82) 

Thus, once /* is computed, a lower bound for / becomes readily available, that is, 

f(x)>x T l- -/*($), (13.83) 

where now £ is interpreted as a parameter. To investigate this bound a bit further, we plug Eqs. (13.79) 
and (13.80) into Eq. (13.83) to obtain 

f(x) > /(**) + (x - x*) r V/(x*), 

where the right-hand side is the linear function g(x) describing the hyperplane tangent to f(x) at x t . 
The bound becomes tight at x — x* (see Fig. 13.10). We will soon see how we can make this linear 
function bound a nonlinear one; it suffices to transform the argument of the function. 

Ali that has been said for convex functions applies to concave ones if we replace the max operation 
in Eqs. (13.78) and (13.82) with min operations. Note that following this definition, the conjugate 
function is concave, being the resuit of point-wise minimization of a set of concave functions (an affine 
function can be considered either convex or concave). Furthermore, if the involved function is neither 
convex nor concave, one can search for invertible transformations that render it convex or concave. 

In our context, the purpose of resorting to the notion of the conjugate function is our expectation 
that such a function, which bounds a (PDF) function as in Eq. (13.83), may lead to a functional form 
that lends itself to tractable computations of the involved integrations. 

Example 13.3. Compute the conjugate of the logarithmic function /(x) = Inx, x > 0. 

The logarithmic function is known to be concave. Hence 

/*(£) = min{§x-lnx}, 

;t>0 


or 


Hence, 


Therefore, 


x* : 


1 

- = £ =>■ x* = 

X 


1 

r 


/*(?) = l+ln£. 


Inx = minjfx — 1 — ln£}. 

?>0 

Fig. 13.11 shows the respective graphs of the logarithmic function as well as the resulting linear func¬ 
tion bound for different values of £. 


6 


Note that this is a necessary and sufficient condition for convexity. 
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FIGURE 13.11 

The linear functions g(x; f) = $x — 1 — Ini; provide upper bounds to /(x) = lnx. Each one of the lines is tangent 
to /(x) = lnx at the point x* = |. 


Example 13.4. Consider the univariate Laplacian PDF, which we have already seen in Section 13.5, 

p(Q)= / - eX p(-X\0\), 6 eR. (13.84) 

Our goal is to derive a lower bound in terms of its conjugate function. From Eq. (13.84), we get 


\np(Q) = \n^ 


-m- 


Define 


Then 


/(x) = ln — — Xyfx, x > 0. 


(13.85) 


ln p(0) = f(6 2 ). 


(13.86) 


Note that /( x) is a convex function with respect to x (Problem 13.16). The conjugate of /(x) is 
obtained as 


/*($) = max 

X 


~ 2 X ~ 


% > 0 . 


(13.87) 


§ is constrained to positive values, because for $ < 0 the maximum with respect to x becomes infinite, 
which violates the definition of the conjugate functions. Recalling Eq. (13.85), maximization leads to 


: Xx 2 = $ ■ 


•x* = A. 2 f 2 . 


(13.88) 


7 Because maximization takes place for ali we use — |. This is only for notational convenience in order to obtain the resuit 
in a convenient form. 
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FIGURE 13.12 

The Laplacian (red curves) and the approximating Gaussians for two different values of f. 


Combining Eqs. (13.87) and (13.88) gives 

* i A, 

/ (£) = y £ -In-. (13.89) 

Hence, we obtain the bound 

£ X 1 | X 
or 

lnp(0)>-^ 2 -yr'+ln^ ?>0. 

Because this is true Vi) > 0, we can replace i) with £ _1 , for notational convenience, which results in 

P(0) > ^exp^-y-0 2 ^jexp^-y^ , (13.90) 

which, after mobilizing the Gaussian notation and its integration property, can be rewritten as 


p(e)>m |0,f)<Kf), $>o, 


(13.91) 


with 



§> 0 . 


This is very interesting indeed. The obtained lower bound has a functional dependence on 6, which is of 
a Gaussian nature; the Gaussian term is centered at zero with variance i). Maximizing with respect to i), 
we will obtain the required approximation. Fig. 13.12 shows the obtained approximation for different 
values of i). Observe that introducing transformations in the involved variables can render the function 
in the obtained bound a nonlinear one. 
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For a multivariate Laplacian and assuming a parameter vector with independent components, it can 

be trivially shown that 

K -1 K -1 

pm= n P(0 k )>U(e\0,E)Y[(p^k)-=p(0^), (13.92) 

k= 0 <r=0 

where £ = [£o, • ■ •, and 

H := diag{§ 0 , ?t.£/r-l} ■ 

Remarks 13.3. 

• The method of representing a convex function via the optimization of the lower bound in terms of 
its conjugate (dual form) is known as the variational method, and the associated parameters §* as 
variational parameters [67]. Its use in the context of machine learning was first reported in [31] (see 
also [35]), and its use subsequently proliferated and was adopted in different scenarios. 

• The method has been used to obtain variational approximations for a number of PDFs that are suit- 
able for sparsity-aware learning, for example, Jeffreys’, Student’s t, generalized Gaussians (see, e.g., 
[59]), and the logistic regression model [33] (Problem 13.18). Compared to the Laplacian approx- 
imation method, the variational approach provides the extra flexibility of optimizing with respect 
to the corresponding variational parameters (see [33] for a related discussion). The reader, how- 
ever, has to keep in mind that both approximations need not always be good ones. This is readily 
observed from Fig. 13.12. One may obtain a good approximation locally, but not everywhere. How- 
ever, it turns out that in practice, in the context of Bayesian learning, a poor approximation of the 
prior (for which such approximations are used) does not necessarily lead to a poor approximation of 
the posterior. There is no guarantee of it, and it is only the performance in practice that has the final 
verdict. This can be considered a drawback of the Bayesian technique compared to the deterministic 
methods based on optimization criteria. The latter ones, by adopting convex cost functions, can lead 
to Solutions that are well characterized. In contrast, Bayesian inference techniques suffer from their 
nonconvex nature and the fact that very often the imposed approximations may not necessarily be 
good ones. However, this is always the case in life. There is no free lunch. At the time this book 
was written, both of these paths to machine learning stili comprised viable and powerful techniques, 
with their pros and cons. 


13.9 SPARSITY-AWARE REGRESSION: A VARIATIONAL BOUND 
BAYESIAN PATH 

The goal of this section is to demonstrate the use of convex duality and the respective variational bounds 
in order to approximate the computation of the evidence function in cases where the corresponding 
integral is intractable. We have chosen to describe the method in the framework of sparsity-aware 
learning; this can also help us establish bridges with Chapter 9, and some of the results will be used in 
Section 13.9 for this purpose. 

To comply with the assumptions made in Chapter 9, and without loss of generality, let us assume 
that the involved data have zero mean values. If not, the training data can be centered by subtracting 
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FIGURE 13.13 

(A) The one-dimensional Laplacian for different values of X. (B) A plot of a two-dimensional Laplacian PDF. 


their respective sample means; thus, we set 9q = 0 and assume that the number of parameters is K. 
Then our regression model becomes 

y = <t>0 + t), r),yeR N ,QeR K ,K>N 1 

where we are informed about the “secret” that most of the components of 0 are (almost) zero. In 
Section 13.5, we commented on the inadequacy of a Gaussian prior to provide a reasonable statistical 
description of a sparse random vector. A heavy-tailed distribution that enjoys popularity for sparse 
modeling is the Laplacian one. After ali, adopting a Laplacian prior to 0, that is, 

K K X 

p(0) = ]"~[ p(0 k ) = Y\ -exp(-A.|0*|), 

k =1 k= 1 2 


and a Gaussian conditional PDF, p(y\0), for the observations y as in Eq. (13.22), makes the MAP 
estimation identical to our familiar LASSO task, discussed in Chapter 9. In this vein, we will build this 
section around the Laplacian PDF. Fig. 13.13A shows the Laplacian for different values of k. In 13.13B 
the two-dimensional plot is provided, from which the respect that this PDF shows to sparse Solutions 
is readily observed. 

The problem with the Laplacian PDF is that its presence in Eq. (12.14) makes the computation of 
the integral computationally intractable. Also, recall from Section 12.8 that the Laplacian PDF does not 
belong to the computationally attractive exponential family. To facilitate its treatment, we will employ 
the variational bound approximation method in order to approximate the Laplacian by a Gaussian, 
following Eqs. (13.90) and (13.92). The variational parameters will be determined by maximizing 
the respective evidence via the EM algorithm. This method in recovering sparse Solutions was first 
introduced in [21] in the context of dictionary learning. 
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For simplicity, let the noise sequence in our regression model be white with variance a~ := Then 

p(y\9-p)=Af(y\^0,p- l I), 


and using Eq. (13.92), we can write 

p(y\P) = 


J N(y\<S>6,r l Dp(0)dO 

[JV{y\<J>0,p- l I)Af(0\O, s )d»\ ]“[ 

■ ' k= 1 


0®t). 

We know by now that the integral results in a new Gaussian function (recall Eq. (12.15)), and hence 

K 


P(y, P)>N(y\0,P 1 / + O3$ r )] _ [0(^):=/5(y;A. E). 


(13.93) 


k=\ 


The unknown values of p and E could be obtained by direct maximization of the previous bound. 
However, here we will adopt an EM algorithm approach, in a similar way as in Section 12.5; the 
difference here lies in the existence of the multiplicative terms (pi^k) that differentiates the M-step. In 
order to employ the EM, we need to know the posterior p(0\y\ P). We will accept the following, 


p(O\y,p)~p(0\yP,S):= 


Af(y\&9, p~ l I)J\f(0\O, E) 

/ N(y\<S>0,P~ l I)N{0\Q, E )d0’ 


(13.94) 


where in place of p{0 ) we have used its respective bound, p(0: §), from Eq. (13.92). Note that irrespec- 
tive of which method one adopts to optimize with respect to the unknown parameters, the approximate 
posterior given in Eq. (13.94) is the quantity of interest in regression; this is used either to predict 0 or 
to perform predictions of the output value (Eq. (12.18)). 

It must be stressed, however, that Eq. (13.94) is not a bound of p(0\y, P) anymore, because nor- 
malization has taken place and division does not necessarily respect bounds. Recalling what we said in 
Section 12.2.2 (Eqs. (12.27) and (12.28)) we obtain 


P(0\y,p, E)=AT(0\iig ly ,Ee\y), 


(13.95) 


where 


/^=Ed> r (^/ + <I>EcD r ) y, 

Ze\y= E- E<fc r ^/ + 4>EcD r ^ <t>E. 


(13.96) 

(13.97) 


We are now ready to give the algorithmic steps. Recall that in EM, the goal is to maximize the expected 
value of the complete log-likelihood with respect to the unknown set of deterministic parameters. In 
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our case, our goal will be to maximize the corresponding bound with respect to /1 and §, 


E 


[ln/j(j>,0; A) = fi[ln(p(y|0; P)p(Q)j > E [ln (p(y\Q; /3)^(0; £))] 


Assuming § (0 \ fi m> are known, the (j + l)th iteration comprises the following computations: 
• E-step: From Eq. (13.96) and Eq. (13.97) compute 

and s e\r 

Following similar steps as in Section 12.5, we readily obtain 


Q(€, A; $ {j \ P U) ) = yln/3- y ln(27T) - ^Eg ly [||y - 06 

K K j 

+ £ln0(k)--ln(2*)--ln|E| 

A= 1 

1 * E 0.v[^] 

& ’ 


( 13 . 98 ) 


where (recall Eq. (12.49)) 

Ee|v [|| J — O0H 2 ] = || j — II 2 + trace { 


® 4 |> r 


and (recall Eq. (12.47)) 


E 0\y [&*] 


u (j) LL {j)T + F {j) ' 


AA 


M-Step: Taking the derivative of Q(£, /1; f /i (,) ) with respect to /1 and equating to 0 we get 


«0+1) N 

||y-<t>4|Jll 2 + trace 


1' 


The derivation with respect to £*, k = 1,2,..., K, results in (Problem 13.19) 

Al+i) _ Kv [ 0 a] 

^ -y - 


( 13 . 99 ) 


( 13 . 100 ) 


which completes the loop. Iterations continue until a termination criterion is met. 

An alternative viewpoint that justifies the maximization of the bound of the evidence with respect 
to the variational parameters is the following (see also [95] for a related discussion). At each iteration 
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step, the EM algorithm maximizes E[/j(_y|0; /i) /3(0; §)] due to the monotonicity of the logarithmic 
function. Equivalently, this can be seen as the following minimization task: 


e[p(jIM)|p(0)-p(M)| 



(13.101) 


where the lower bound property (Eq. (13.92)) has been used in order to involve the absolute value. 


Looking at Eq. (13.101), one may think of a reason that justifies what in practice is commonly observed; 
that is, the method results in good performance although the overall approximation of the prior may not 
be a good one. The important issue is to have a good approximation in values of 0 that correspond to 
relatively large values of p(y\0). The approximation in ranges of 0 where p(y\0) s» 0 does not affect 
the main goal of the task. Moreover, Eq. (13.101) could also provide a justification of the relative 
advantage of the variational approximation method compared to the Laplacian method; in the latter, 
there is no room left to leverage any extra parameters in order to improve the final goal. 

Remarks 13.4. 

• As we have already commented in Chapter 9, sparsity-aware learning has been a field of intense 
research. Undoubtedly this is also the case for the Bayesian approach to sparsity promoting models. 
So far, we presented a hierarchical approach in Section 13.5, where sparsity was indirectly imposed 
by associating a gamma PDF prior on each one of the precision variables individually; this led 
to an equivalent high-tail Student’s t PDF description of the involved parameters 0. In the current 
section, a Laplacian prior was imposed on 0 in order to promote sparseness. These are not the only 
possibilities. We focused on them in order to demonstrate two possible paths to treat the evidence 
maximization whenever the resulting integral is computationally “awkward.” 

• In [18], sparsity in the Bayesian framework is attacked by imposing a Gaussian prior on the param¬ 
eters, treating variance as latent variables with an exponential prior. Such modeling is equivalent 
to a Laplacian PDF, once variances are integrated out. The EM procedure is then used to compute 
the required estimates. Also in this paper, the use of Jeffreys’ prior ( p(x ) ~ -|) is proposed as an 
alternative to the Laplacian one. 

• In [5], sparsity is imposed in a similar way as before, but in the hierarchical model. the parameter 
controlling the exponential prior of the precisions is also treated as a latent variable with a Jeffreys’ 
prior. In [6], sparsity on the unknown parameters was imposed via a generalized Gaussian PDF, that 
is. 



(13.102) 


Combining this prior with a Gaussian PDF for the conditional, p(y\0) in Eq. (13.22), would resuit in 
a MAP that corresponds to the LS regularized by a nonconvex i p , p < 1, norm; we know that such 
norms are more aggressive, compared to the i\ norm, in recovering sparse Solutions. For / 3 = 1 , 
Eq. (13.102) becomes the Laplacian prior. In [6], gamma priors are used in association with the 
hyperparameter a and the noise variance, and the variational Bayesian approach is used to obtain 
the solution. 

• In [34], the RVM framework is exploited to obtain sparse Solutions and the information related to 
the variance of the obtained estimates is used to determine the number of measurements, which is 
sufficient for recovering the solution in the framework of compressed sensing. 
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SPARSITY-AWARE LEARNING: SOME CONCLUDING REMARKS 


The task of sparsity-aware learning has been treated in a number of parts in this book. In Chapter 9, 
it was treated as an optimization problem of a regularized cost function. In Section 13.3 the automatic 
relevance determination (ARD) concept in Bayesian learning was discussed, and in Section 13.9, the 
variational bound technique was exploited to overcome the computational obstacle associated with the 
Laplacian prior. 

In this vein, there is a number of questions that are naturally posed. The first one concerns the 
relationship between the Bayesian and the regularized cost function optimization approaches. How 
different are they? Are there any paths that establish connections among them? The second question 
addresses theoretical issues associated with the performance of the Bayesian techniques compared to 
their counterparts that were discussed in Chapter 9. A first systematic attempt to address both questions 
was made in [73,74,91,93,94]. Furthermore, another important theoretical question addresses the task 
of identifiability concerning the SBL models [64]. 

A summary of results concerning the previous tasks can be downloaded from the book’ s website 
under the Additional Material part that is associated with the current chapter. 


13.10 EXPECTATION PR0PAGATI0N 


Expectation propagation is an alternative to the variational techniques for approximating posterior 
PDFs. The task of interest is the same as the one treated in the beginning of this chapter, in Section 1 3.2. 
Assume that we are given a set of observations X, which are distributed according to p(X\0), and a 
prior, p(0), corresponding to the set of the unknown parameters 9. The goal is to obtain an estimate 
of the posterior, p(0\ X), assuming that its computation is intractable. 

Let us denote by q(0) the estimate of the posterior. The starting point is to compute q by minimizing 
the KL divergence, 



(13.103) 


Note that KL(p\\q) is different from the KLL/ \p) divergence, which is involved in the bound in 
Eq. (13.2). Because the KL divergence is not symmetric, the two methods minimize a different cost. 
Before proceeding any further, it is important to highlight some implications associated with the two 
forms of the KL divergence. 

• I-Projection : The KL(</ \p) divergence is given by 



(13.104) 


This is sometimes known as I-projection or Information projection. Looking carefully at it, note 
that in regions of the parameter space where p(0\X) assumes small values, KL(c/11 p) gets large 
values and minimization pushes q ( 0 ) to small values as well. Consider now the case that p{0\X) is 


If other hidden variables are also involved, we consider them as part of 0. 
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bimodal, while q(6) is constrained to be unimodal. Then minimizing KL(7/| \p) will force q to be 
placed close to either of the two peaks of p in order to get small values in the regions where p takes 
small values too. 

• M-Projection : We now turn our focus to KL( p\\q) divergence, defined in Eq. (13.103). This is also 
known as M-projection or moment projectiori. For the case discussed before, in the regions where p 
assumes large values, KL(p| \q) gets large values and minimization estimates q in order to have large 
values in these regions too. Thus, the estimate q is placed in order for its mode to lie somewhere 
between the two modes of p, as a compromise between the two. Obviously, this is not a good resuit, 
because the estimate puts high-probability mass in regions where p assumes small values. This 
discussion points out some limitations on the performance that the expectation propagation method 
is expected to exhibit in practice, because it is based on KL(/?| \q) minimization. 

We now assume that p{X , 0) can be factorized, that is, 


p(xj)=Y\fjm. 


( 13 . 105 ) 


For example, such a product can cover the case where 


p(X,0) = Y[p(x n \0)p(0), 


n 


where p(0) is the corresponding prior. The more general formulation of the factorization used in 
Eq. (13.105) can serve the needs for more general tasks, as for example graphical models to be treated 
in Chapter 15. Thus, we can now write 



( 13 . 106 ) 


where p(X) is the evidence of the model. The estimate q will be chosen to be given in a factorized 
form, as in the variational approach in Section 13.2, that is, 


qm= X z Y\fj(0), 


( 13 . 107 ) 


i 


where fj(fi) corresponds to f j(0) and Z is the normalizing constant. The next assumption is that q(0) 
is constrained to lie within the exponential family of PDFs (Section 12.8) and for our current needs it 
can be written as 


q(0) := g(a)h(0) exp (a 1 «(0)) 


( 13 . 108 ) 


where a is the associated set of parameters. 
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MINIMIZING THE KL DIVERGENCE 

Plugging into Eq. (13.103) the definition in Eq. (13.108) and collecting ali terms that are independent 
of a in a constant, we readily obtain 

KL(/?||g) = — lng(a) — J p(6\X) [a 1 u(6)J dO + constants. (13.109) 

Taking the gradient with respect to a and equating to zero we get 

-l-Vg(fl)=Ep[ii(e)]. (13.110) 

8 (a) 

However, from Eq. (13.108) we have 

g(a) J h(6) exp (a 1 u(0)^ dO = 1, 
and taking the gradient with respect to a results in 

0=Vg(a) J h{9) exp (a T uiO)^ dO 

+ g(a) J h(0) exp (^a 1 u(0)^u{d)d0 
or 

-i-Vg(a)= E, [«(e)], 

g(«) 

which combined with Eq. (13.110) finally results in 


E ? [m(0)] = [«(0)] : moment matching. 


(13.111) 


The latter is an elegant equation known as moment matching. It basically States that at the optimum, 
q(0), the expectations of its sufficient statistics are equal to the expectations associated with the PDF 
to be learned. For example, if q is chosen to be a Gaussian, the sufficient statistics involves the mean 
and the covariance matrix. Thus, ali one has to do is compute the mean and covariance with respect to 
p(6) (assuming that they can be obtained) and use them to define the respective Gaussian. 


THE EXPECTATI0N PR0PAGATI0N ALG0RITHM 

We will now make use of the moment matching resuit to obtain the factors, fj(0), one at a time. The 
algorithm starts from some initial estimates, /j°\ Let us assume that we are currently seeking to update 
factor fk(9). Let q^(0) be the currently available estimate of q(0), at the ith iteration. 

Step 7: Remove fl'\0) from q^\0), and define 



g ( 0 (fl) 

/f (#)' 


(13.112) 
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Step 2: Define the PDF 

fk(O)qfl(0 )• (13.113) 

^k 

In other words, in the current estimate q ( '\0), fl‘\0) is replaced by fk(0), and Z k is the 
corresponding normalizing constant. 

Step 3: Compute the normalizing constant, 

Z k = j f k (0)q^(0)dO. (13.114) 

Step 4: In this step, the optimization is performed by minimizing the KL divergence, 

KL ^J k (0)qf 1(6)\\ q(i+ 'H0)) . 


This is achieved by moment matching, and the new q {l+l} is defined so that the expectations 
of the respective sufficient statistics are matched to those of fk(0)q ( '/' k (0)’ and this operation 
is assumed to be computationally tractable. 

Step 5: Compute ) such that 


ft i+l \0):=K 


q {i+l) (0) 
?/*(«) ’ 


where the proportionality constant is computed so that 


(13.115) 


f f? +1 \8)qflmdo = J /mqjlm ae. 


(13.116) 


which results in K — Z k . 

The procedure is then applied for the estimation of fk+\^- For convergence, more than one passes 
have to be performed. The evidence can be approximated as 


/ n fj(O)d0. 

j 


(13.117) 


A detailed application of the algorithm in the context of a simple-to-follow example is given in [48]. 
Remarks 13.5. 


• In general, there is no guarantee that the algorithm will converge, which is a major disadvantage of 
the method. However, it can be shown that if the iterations do converge, the solution is a stationary 
point of a particular energy function [48]. Recall that in the variational Bayes approach, there is 
guarantee of convergence to a local optimum point. Of course, one could optimize the KL diver¬ 
gence for the expectation propagation method directly, which guarantees convergence, but in this 
case the algorithm is more complex and slow. 
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• Taking into account our discussion concerning the two forms of KL divergence, the expectation 
propagation method results in poor performance when the true posterior is multimodal. However, 
for other scenarios, such as logistic-type models, the expectation propagation method can offer 
competitive and sometimes better performance compared to the variational methods or methods 
built around the Laplacian approximation (see, e.g., [39,48]). 

• The expectation propagation algorithm was first proposed in [48], and it is a modification of what 
was known before as assumed density filtering (ADF) or moment matching (e.g., [57] and the refer- 
ences therein). 

• Looking at the factors fj(0) in a more general view, it turns out that the expectation propagation 
offers the vehicle for obtaining a range of message passing algorithms in the context of probabilistic 
graphical models (Chapter 15) [49]. 

• a-Divergence: Having spent time discussing in some detail the two forms of KL divergence, it is 
interesting to point out that both formulations can be obtained as special cases of a more general 
family known as the a family of divergences, defined as 


D, 


•\p\\q ) ■= YZ ^2 (i - / pM a+a)/2 q(x) (l ~ a)/2 d.xj 


: a-divergence, 


( 13 . 118 ) 


where a e R. is a parameter. Note that KL(p||g) is obtained at the limit a —> I and KL(q\\p) is 
obtained for a —> — 1; D a (p\\q) is nonnegative and it becomes zero if p = q (see, e.g., [2]). 


13.11 NONPARAMETRIC BAYESIAN MODELING 

The Bayesian approach to parametric modeling has been the focus of our attention in the current and 
previous chapters. The underlying assumption was that the number of the unknown parameters was 
hxed and finite. We now turn our attention to a more general task. We will assume that the hidden 
structure of our model is not fixed but is allowed to grow with the data. In other words, its complexity 
is not specified a priori but is left to be determined from the data. This is the reason that such models 
are called nonparametric, recall from Chapter 3 that a model is called parametric if the number of free 
parameters is fixed and independent of the size of the data set. 

We will avoid treating nonparametric Bayesian models in a mathematically rigorous sense. Such 
a path would take us a bit far from the purpose of this book and also from the mathematical skills of 
the average reader. Thus, we will be content with presenting the main concepts in a mathematically 
“humble” way. Once the basies have been grasped, the keen reader can delve deeper into the topic by 
referring to more specialized literature (see, e.g., [27]). To demonstrate the idea behind nonparametric 
Bayesian models, we will start with our familiar mixture modeling task. In the sequel, the so-called 
matrix factorization problem will be introduced and it will be treated in the nonparametric context. 

In Section 13.4, K mixtures (clusters) were assumed. Each mixture was modeled via a Gaussian 
PDF with unknown mean and precision matrix, (fi k . Qk), k= 1 , 2 ,..., K. These, in turn, were consid- 
ered as random entities and were dressed up with a prior—a Gaussian PDF for the mean and a Wishart 
one for the precision matrix. The probabilities P k , k = 1,2,..., K, for each mixture were treated either 
as constants, which were optimized during the M-step, or they were considered random variables and 
a Dirichlet PDF prior was associated with them (Problem 13.7). The goal was to obtain an estimate 
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of the posterior probabilities of the labeis associated with each observation point, P(k n \X). There, 
we resorted to variational techniques via the mean field approximation and the posterior was approx- 
imated by the function q z (Z) = q z (z i,..., zn)< where z n , n = 1, 2,..., N, were 0-1 coding vectors, 
with a one placed at the location corresponding to a specific mixture (k) and the rest of the elements 
being zero. We refer to such vectors as one-hot vectors. In this way, an equivalent clustering of the 
observation points was achieved in K Gaussian-distributed clusters. 

The nonparametric counterpart of the previous task is expressed in almost the same way, albeit with 
a single important difference. The number of mixtures, K , is not fixed to a linite value; as a matter of 
fact, the number of mixtures is allowed to be countably infinite. There are two questions that now pop 
up: (a) how can one deal with an infinite number of clusters, and (b) how can one deal with a prior 
distribution related to infinitely many probability values? 

To give an indication of how to deal with infinity, recall that in nonparametric modeling the number 
of data points is stili finite and equal to N. Thus, whatever model one adopts, there is no way of having 
more than N mixtures (clusters); the latter corresponds to the worst-case scenario, where each one 
of the points belongs to a different cluster. Hence, although in theory one can have infinitely many 
clusters, only a finite subset of them is nonempty. Thus, all we need to do is to obtain an explicit 
representation of the nonempty mixture components. 

Concerning the second question, it can be shown that a prior distribution over an infinite number 
of groupings, P(Z), that favors assigning data to a small number of groups is the Chinese restaurant 
process (CRP); this is a distribution over infinite partitions of the integers [1,22,63]. 

We will first state the algorithm that generates a draw (realization) from such a process and then we 
will provide more details for those who are interested in further theoretical aspects that are associated 
with such processes. 


13.11.1 THE CHINESE RESTAURANT PROCESS 

The name draws from the seemingly infinite number of tables in some very large Chinese restaurants in 
California. Each table is associated with one cluster/mixture and each customer with one observation. 
The first customer sits at the first table. The second customer sits at the first table with probability yT_ 
and at a new table with probability The nth customer sits at one of the previously occupied tables 
with a probability proportional to the number of people who already sit at it, and he/she sits at a new 
table with a probability proportional to a. The parameter a is known as the concentration parameter. 
The larger its value, the more tables are occupied and the fewer the customers who sit at a single table. 
In a more formal way, let k„ denote the table for the nth customer. Then we can write 


P(k n =k\ki :n -i) = 


-——, ifk< K n -i , 

n — 1 + a 
a 

-, otherwise, 

n — \ + a 


( 13 . 119 ) 


where K n _\ is the number of tables occupied by the previously n — 1 arrived customers and n/. the 
number of customers already sitting at table k. It can be shown that the expected number of occupied 
tables grows as a In n ; that is, the expected number of clusters grows with the number of data (e.g., 
[19]). The rule in Eq. (13.119) provides the sampling philosophy for assigning data to clusters as new 
data arrive sequentially. It can be shown (see, e.g., [19]) that the resulting probability P(k \,..., k n ) is 
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Customers 



FIGURE 13.14 

Every customer sits at one of the previously occupied tables or selects to sit at a new one. Observe that the proba- 
bility of a customer to select to sit at a new table decreases fast. Dots indicate the tables where customers sit. The 
first table is occupied by many customers. As we move downwards to the second, third, etc., table, the number of 
customers that select them is fast decreasing. 


independent (up to label changes) of the sequence in which data a iri ve; this is an important invariance 
property. 

Fig. 13.14 illustrates a draw from a CRP process. Each customer sits at one of the previously 
occupied tables with some probability or prefers to sit at a new one. Observe that the probability of a 
customer to select a new table decreases fast. In contrast, the probability of a customer sitting at a table 
already occupied increases with the popularity of the table (rich gets richer). 

13.11.2 DIRICHLET PROCESSES 

This section provides a brief discussion of the more general mathematical framework in which CRPs 
belong. This section can be bypassed in a first reading. 

The notion of a stochastic process was introduced in Section 2.4. A Dirichlet process (DP), G, 
which was first introduced in [16], is a distributiori over distributioris and it is defined in terms of (a) 
the concentration parameter, a, and (b) the so-called base distributiori, Go, over a space 0, and we 
write G ~DP(a, Go). We say that G ~DP(a, Go) is a DP if for any partition 7^, k= 1,2,..., K, of 
©, i.e., © = uf =I T/c, the following holds true: 

(G(7i),..., G{T K )) ~ Dir(aG 0 (7i),.... aG 0 (7jf)), (13.120) 


9 Strictly speaking, we should say measurable partition; a partition is measurable if it is closed under complementation and 
countable union. 
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where Go(74) is the probability (according to G o) corresponding to the occurrence of 7&, and G (7 )) is 
similarly defined. In other words, G ( 'G ), k = 1,2 ,..., K, are jointly distributed according to a Dirich- 
let distribution. To establish the connection with Chapter 2, where the Dirichlet probability distribution 
was defined, aGo(Tk) correspond to the associated parameters of the distribution in Eq. (2.95), i.e., 
cik, k — 1,2 ,,K, and G (7)) to the values of the respective random variables, Xk. Recall that by the 
definition of a Dirichlet distribution, xk, k = 1,2,..., K, lie in the interval [0, 1] and they add to one, 
and hence they can be interpreted as probabilities. 

Also, from the definition of a random process in Section 2.4, we know that every realization that 
results from an experiment comprises an infinite number of samples (countable or uncountable), de- 
pending on whether it is a discrete or a continuous one. It has been shown in [16] that a DP is of a 
discrete nature ; moreover, every realization can be interpreted as a probability distribution. In other 
words, the realizations of the process comprise distinet samples and each one corresponds to a proba¬ 
bility value. Obviously, their sum is equal to one. Moreover, they comply with the defining property in 
Eq. (13.120). Mathematically, the above can be formulated as 


(13.121) 


where <5#, (0) is equal to 1 if 0 = 6j and zero otherwise. In other words, a draw (realization) G ~ 

DP(a, G o) puts a probability mass, Pj, on a specific countably infinite set of samples Oi, i = 1,2. 

The points 0, are known as atoms and they are i.i.d. drawn from the base distribution, G o, i.e., 0, ~ Gq. 
Let us now elaborate on the resuit in Eq. (13.121) and its connection with Eq. (13.120) a bit more. 

Take any subset 1 J\ C © and collect all indices in Eq. (13.121) 4 := {i : 6j e 7)]. Then G (7)) = 
J2 ieIk Pi■ Moreover, it is guaranteed that G(0) = 1, since G is a probability distribution and 0 = 
uff =1 Tk- A DP guarantees that the probability values G(7/-). k — 1,2,.... A", follow a Dirichlet dis¬ 
tribution as in (13.120), for any finite number K. Note that Go may not be discrete. For example, it 
can be Gaussian or another probability density function. Note that G is random in two ways; both the 
probability values P, and the locations b, are randomly obtained. 

Mean and variance : For any subset C ©, the mean value is 



E[G(7]t)] = G 0 (7jt), 


and the variance is given by 


var 


[G(7*)] - 


G 0 (7*)(l - G 0 (73t)) 


OL + 1 


The proofs are given in Problem 13.20. The above indicates that the draws from a DP distribution stay 
“around” the base one and the variance is inversely proportional to the concentration parameter. 

Posterior from prior distributions: We know that every draw G ~ DP(a, Go) is a distribution and 
it is used to draw samples from 0. Let 0; e © ~ G be a sequence of such i.i.d. drawn samples. Our 


10 Strictly speaking, with probability one. 

11 Strictly speaking, measurable. 
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goal now is to derive the posterior probabilities G(7).), k = 1,2 ,.K, given a set of n observations, 
0U...,On. 

To this end, recall from Example 12.6 that the Dirichlet distribution is the conjugate of the multi- 
nomial one, given in Eq. (2.58), i.e., 


P(ni,n 2 . n K )=( " )flG(7»"*, (13.122) 

\n t,- n K ) ^ 

where we used nk in place to Xk to be in line with a more Standard notation used in DPs. Recall that nk is 
the number of times the kth variable associated with probability P — G(Tk) occurs, after n successive 
experiments. In our setting, n* is the cardinality of the previously defined set 4, i.e., 

n k =#{i : Oj e 7),}. 


In words, ;?£ is the number of draws from © that lie in 7). 

Having observed 0 \,... ,6 n , we see they are equivalent to specific occurrence numbers, n\ , 

Thus, taking into account the property of the conjugate pairs, as expressed in Eq. (12.77) in the context 
of Example 12.6, it can be readily seen that 

(G(Ti),..., G(T K )\0 u ... ,0 n ) = Dir(aG Q (T l ) + m,... ,aG 0 (T K ) + n K ). (13.123) 

It is not difficult to see (e.g., Problem 13.21) that the above can be equivalently written as 
(G(7j),..., G(T k )\0 u ..., 0 n ) = Dir(a'G[,(7j),..., a'G' 0 (T K )), 
where the new concentration parameter and base distribution are given by 

1 / ” 

a' = a + n and G ' n = -( aGo + Se,- (0) 

a + n \ L —' 


Hence, we can compactly write that 



(13.124) 


The above is an elegant resuit. The concentration parameter increases by the number of observations 
that we have at our disposal. Recall that the variance around the mean of the associated probability 
values is inversely proportional to the concentration parameter. Hence, the obtained resuit is in line 
with common sense. The more observations we get, the lower our related uncertainty becomes. Also, 
the base distribution associated with the posterior DP comprises two components, i.e., the original one 
and a distribution which is of a discrete nature. As a matter of fact, it imposes probability masses on 
the observed values. 
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Pre dicti ve Distributiori and the Polya Urn Model 

Let us now buildupon the posterior DP formulation in Eq. (13.124). To this end, the reasoning provided 
in [80] will be followed. Having observed the samples 0\,,0 n , we focus on the next draw, 0 ll+ \, 
and we will compute the probability of the sample to lie in an interval T. To simplify notation, let us 
use G' to denote the posterior G\0 \,..., 0 n . Then we write 

P{9 n+l eT\9 { . 9 n ) = G\T). 


However, we know that G' is itself a random draw from the posterior DP. Marginalizing out with 
respect to G ', i.e., taking the expectation, and recalling the property of the mean value stated before, 
we get 


P(9 n+l eT\9 l ,...,0 n ) = —- («G 0 (T) + Ts 0i (9eT) 

rv — 1 — vi \ ‘ J 


i =1 


The above is true for any T C0, and it can be reformulated as 


9n+i\0i, -,0 


a 

—:—Go + 

a + n 


n 

a + n 





(13.125) 


In the above formulation, the term in parentheses is written in a way to remind us of its discrete 
probability nature that adds to one. 

The last formula leads to the following physical interpretation. Consider an empty urn and assume 
that each value in 9 corresponds to a unique color. Also, we are given a countably infinite number 
of balls. Pick at random a color from the basic distribution, i.e., 0,- ~ Go, and paint the ball with the 
corresponding color. After the ball has been painted, it is placed in the urn, which now contains one 
ball. In the second step, we pick up a new ball and we either (a) pick a new color, 02 ~ Go, with 
probability paint it accordingly, and place the painted ball in the urn, or (b) with probability ^T_, 
paint the ball with the same color as that of the ball which is already in the urn and then place it in the 
urn. Now the urn contains two balls. The steps are repeated. So, at the (n + l)th step, the urn contains 
n balls. We pick a new ball and we either (a) pick a new color 0 n+ \ ~ Go with probability paint 
the new ball, and place it in the urn, or (b) with probability we randomly pick one of the n balls 
in the urn, and choose its color to paint the new one; after painting, it is also placed in the urn, which 
will now contain n + 1 balls. This interpretation has been used in [10] to show the existence of DP 
processes. 

There are two important properties. Note that as n increases, the probability of selecting a color 
from the urn keeps increasing, relative to of selecting a color from G o- Thus, for long sequences, 
we are going to select more and more previously used colors, which will be repeated. This justifies the 
discrete nature of the process. 

The other important property is that it turns out that the probability of generating a sequence of 
colors, i.e., 9\ . 9 n , is equal to the probability of generating these colors in any sequence. That is. 


P(0 u ...,0 n ) = P(0 nm ,...J 7l(n) ), 
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Concentration a = 1 



0.02 ■ 

-JM 


iluM 




FIGURE 13.15 

Three different realizations of a DP for two different values of the concentration parameter, a. For a = 1, only a few 
(discrete) probability mass values survive, in ali three corresponding draws (top row). On the contrary, for a = 100, 
a much larger number of probability masses are obtained. Note that, in ali cases, the generated masses stay around 
the base distribution, i.e., the Standard normal one for this case. 


where ic(-) denotes any permutation of the numbers in {1,2,..., n}. Such sequences are known as 
exchctngeable (see, e.g., [36] for an insightful discussion). 

Fig. 13.15 shows different realizations for two different values of the concentration parameter a. 
The base distribution is the Standard Gaussian with zero mean and unit variance. Two different values 
of the concentration parameter are used, and for each one three different draws are performed. For 
the smaller value, a = 1, only a small number of discrete probability masses survive. This is natural, 
because in Eq. (13.125), when the value of a is small, the significance of the first of the two terms on 
the right-hand side is small. Thus, the probability of selecting previously generated values is large. The 
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opposite is true when a gets large values. In the latter case, the significance of the first term becomes 
small, only when n becomes relatively large. Thus, a larger number of probability masses survive. 


Chinese Restaurant Process R e vis ite d 

We have already commented on the discrete nature of the DPs. Furthermore, the rule in Eq. (13.125) 
implies a clustering structure of the resulting values from the draws. As said, successive draws (colors) 
repeatedly pick up previously drawn values. Let 0*v-,0*K n be the K n colors that have survived after 
n successive draws. Let n \n g n be the corresponding numbers of occurrence for each one of the 
colors. Then Eq. (13.125) can be rewritten as 


0 n +\\0\ -, 0 „ 


ol + n 


G o ■ 


a T n \n 


1 Kn \ 

-2>a sf (0) • 

11 7=1 / 


Observe that the above formulation leads directly to Eq. (13.119) (if we apply it for the «th instead 
of the n + lth customer). Indeed, we either select a new color (table in the case of CPR) or the /th 
previously selected one, with probability , i < K n . Recall from Section 13.11.1 that according to 
this rule, the expected number of clusters grows as a ln n. 

An alternative way to arrive at the CRP is as a limiting case (K —> oo) of the finite mixture 
modeling, by adopting a Dirichlet prior with parameter a/K (see, e.g., [23,72]). 

The cluster assignments are exchangeable: An important aspect of the CRP is that the probability 
of a specific clustering does not depend on the order customers have arrived and it is unchanged after 
reshuffling the customers, up to a permutation of the table labeis (see Problem 13.23). In words, this 
means that in Fig. 13.14, what is the important information is that, for example, customers #1, #4, #5 
sit at the same table and customer, for example, #3 sits on his/her own. The table numbering is of no 
importance. 


13.11.3 THE STICK BREAKING CONSTRUCTION OF A DP 

We have already discussed the Polya urn model and the related CRP. The stick breaking representa- 
tion is an alternative representation of a DP and it was developed in [75]. The method builds upon 

Eq. (13.121) and proposes the steps to generate / J ( , as well as the respective values of 0/, i — 1,2,_ 

Consider a stick of unit length (Fig. 13.16). The stick is divided into a sequence of infinitely many 
segments of length P, , i = 1,2,..., according to the following algorithm. First, make a draw of a beta 
(Chapter 2) distributed random variable, (i\ ~ Beta(/J| 1, a), and break off a segment of the stick of 
length equal to fi \; we set P\ = fi\. The length of the remaining segment is 1 — /i i ■ Then make another 
draw, /B ~ Beta(/S| 1, a), and break off another segment in proportion to /T. Thus, the length of the 
discarded segment is equal to fi 2(1 — t>\) and we set Pi equal to its length. The length of the remaining 
segment is, obviously, (1 — /Jj) — faG — Ai) = (1 ~ Ai)(T — /T>)- Following this rationale recursively, 
by breaking off pieces from the remaining segments, and starting at P\ — fi\ ~ Beta(/l| 1, a), the /th 
step of the algorithm is 


Beta(j8|l, a), 

7-1 

Pi=Pi rid-ft)- 1 =2,3,.... 
j =1 


(13.126) 


(13.127) 
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Pi = Pi 


P 2 = /32(1 - /3i) 


P 3 =P 3 { 1 - /? 2 )(1 - /?i) 


FIGURE 13.16 

The stick breaking construction of a DP. At each iteration, we sample from a beta distributiori Beta(/3| 1, a) and we 
break off a piece from the remaining segment in proportion to the value of the sample. 


Using the resulting sequence of probability values, Pj, a random distribution is formed according to 
Eq. (13.121), with 0, drawn i.i.d. from Go- It can be shown that the resulting distribution is a DP, i.e., 
G ~DP(a, Go). 

Note that as i increases, the probability values decrease, because smaller and smaller fractions 
of the stick remain. Thus, out of the infinite possible terms in Eq. (13.121) only a relatively small 
number survive that have significant contribution. Moreover, the probability values associated with 
these surviving terms are the first ones to be generated, which, in practice, is very important when one 
implements a DP to generate a prior (see, e.g., [58]). 

A concise tutorial concerning DPs can also be found in [19]. There, a number of sites with publicly 
available Software tools are also provided. 

13.11.4 DIRICHLET PROCESS MIXTURE MODELING 

Having established the discrete and clustering properties of a DP, let us now see how it can be used as a 
prior to Bayesian learning of a mixture distribution, e.g., Gaussian mixture modeling. Our framework 
will be that of a nonparametric setting; that is, the number of mixture components is not a priori fixed. 
This forces us to replace priors over a fixed number of parameters, with priors over distributions. The 
Bayesian learning rationale offers the tools for updating the previous prior information to that related 
to posterior distributions, once observations have been obtained. 

Let us assume that we are given a set of observations, x n e 1R/, n = 1.2...., /V. We assume that 
each one of them is emitted by a corresponding PDF (distribution) parameterized in terms of a re- 
spective set of hidden variables, 0 1 , 02,..., 0n e &. That is, x n ~ p(x\0 n ). Observe that the set of 
parameters grows with N, in contrast to the fixed number of parameters used in the mixture modeling 
of Section 13.4. In turn, we assume that the parameters are i.i.d. drawn from a distribution, G, which 
is itself a draw from a DP (see, e.g., [3]). In summary, the data generation according to this mixture 
model is described as 
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(13.128) 

(13.129) 

(13.130) 

Note that since G is discrete and bears a strong clustering structure, a number of the observations 
samples, x n , will share the same hidden parameters, and this is how the mixture modeling is imposed 
by the DP prior. 

If the stick breaking construction is employed, the above three basic steps are “rephrased,” and the 
data generation mechanism is described as 

1. Draw the beta variables, /3/ ~ Beta(yS11, a), i = 1,2,_ 

2. Generate the probability values, P, = /3, n}=\ (1 — /3/), i — 2, 3,..., with P\ = Ai. 

3. Draw the corresponding parameters from the base distribution, 0, ~ Gq, i — 1,2, _ 

4. For all the observations, x n , n = 1,2,..., /V, Do: 

• k n ~Cat(Pi, P 2 ,...). 

• x„\k„~ p(x„\0k„). 

The labeis k„ comprise the latent variables that indicate the corresponding mixture component (cluster) 
for x n and they are drawn from a categorical distribution, denoted as Cat. The categorical distribution 
coincides with the multinomial when only one experiment in involved, i.e., n — 1 in Eq. (13.122). 
A categorical distribution involves a number of random variables, each one associated with a proba¬ 
bility value. The draw consists of picking up one of the variables, according to the given probability 
distribution (see, also, Section 2.3). The categorical is sometimes referred to as multinulli distribution. 
Concerning the generation of probabilities via the beta distribution, in practice, one selects a value T, 
for which we assume that P, =0, i > T. 

INFERENCE 

The task of inference consists of obtaining the posteriors, given the observations and the functional 
forms of the priors that are associated with the latent variables as well as the involved (hidden) random 
(nondeterministic) parameters. In Section 13.4, the chosen priors were the Gaussian for the mean values 
and the Wishart PDFs for the covariance matrices. As also commented there, at the end of the section, 
the probabilities could also be treated as random variables and adopt a Dirichlet distribution as the 
respective prior. 

In the current setting, because the number of mixtures, K , is not preselected, we will adopt a DP 
mixture process model as a prior. Under this modeling, the following latent variables are involved: 

• The variables /1, , which are associated with the computation of the probabilities P;, i — 1,2, _ 

• The random vectors 0,, which are used to “place” the preselected form of the PDF, which “emits” 
the observations p(x\0) in the input space. Often, the functional form of this PDF is chosen so as 
to form a conjugate pair with the base distribution Gq. 

• The cluster assignment latent variables k n , n = 1,2,..., N, which are also known as indicator vari¬ 
ables. As already stated before, sometimes, the indicator variables are written as one-hot vectors, 
k n , with all components being 0 except the one which corresponds to the position that indicates the 
number of the mixture that the nth variable is associated with, which is set equal to 1. 


G|(«,Go)'- 

” DP(a, G 0 ), 

9n\G' 

“G 0 , 

%n 1^72 

“ p(x\0„). 
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During the learning phase of the posteriors, one has to estimate (a) the parameter a of the beta distri¬ 
butiori and (b) the parameters that define Gq and /?(-1-). As already said, in practice, one assumes that 
Pj — 0. i > T, for a preselected value of T. This is known as the truncated stick breaking representa- 
tion. After all, the maximum number of mixtures one can have is N, i.e., the number of points. In the 
worst case, each point belongs to a different cluster. Moreover, we know that the average number of 
clusters grows as 0(a ln N). 

As said, it is common to stay within the exponential family. In words this means that we select the 
following forms for the PDFs involved in the DP model. 

• For the observations’ emitting PDF, 

p(x\0) = g(0)f{x) exp (o 1 .*;) , 

where we have used the canonical form for the exponential family, as defined in Section 12.8. 

• The base distribution of the DP is chosen as the respective conjugate function, 

G o (0; X, v) = h(X, v)(g(0)) x exp (o 1 u) . 

With such a choice, the parameters that have to be learned, besides a , during learning are X and v. 

We are now ready to formulate the task. Collecting all the random variables together, we form the 
corresponding matrix, 


W:=[0,&,k], 

where /? is the vector with all stick lengths, © is the matrix comprising all the vectors, 6i, i = 
1,2,... ,T, because we use the truncated stick breaking representation, and k is the /V-dimensional 
vector of the indicator variables. 

The goal of the inference task is to estimate the joint posterior, p(W\X), where X is the set of 
the observations, X = {jc i , X 2 , ■ ■ ■ , x^}. However, computing p(W\X) turns out to be intractable. One 
way to overcome the associated difficulties is to resort to the mean field approximation (Section 13.2). 
To this end, the following factorized family for the variational approximation of the posterior distribu¬ 
tion is adopted: 

T -1 T N 

q(W) :=q(P, ©, *) = f[ q yt (ft) («/) Yl ?*. (*")■ (13-131) 

i =1 i = l n =1 

Note that the estimates of the posterior PDFs of the 0, are governed by variational parameters X, and 
are members of the exponential family, as discussed before. The estimates of the posteriors for the 
indicator variables are chosen to be multinomials (categorical) with variational parameters <f> n . The 
estimates of the posteriors for the beta variables are chosen to be beta distributions with variational 
parameters y i ■ Note that the first product involves T — 1 factors. This is because of the truncation and 
the fact that probabilities add to one. This necessarily makes /i 7 = I and, hence, qifir) = 1 (Problem 
13.22). 
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The next step is to learn the above posteriors as well as the involved parameters by maximizing the 
lower bound in inequality (12.68), which for our case becomes 


ln p(X\ $) > F{q\ $) = E ? [ln p (W, X\ $)] - E, [ln 9 (W)]. 


and | := [a, X, v T ] T . The algorithm for maximizing the lower bound follows exactly the rationale and 
the steps established in Section 13.2. Note that maximization with respect to q takes place via the 
respective variational parameters, as they are defined in Eq. (13.131). The algorithmic details can be 
found in [11]; see also [38]. The parameters § are computed ditring the M-step of the algorithm. The 
other alternative is to assume that the parameters are in turn random entities, and in this case specific 
priors have to be adopted. Once the learning phase has been terminated, assignment to mixtures is 
performed according to the posteriors q(k n ) that have resulted from the training. 

The alternative path to learn the posteriors is via Monte Carlo sampling techniques, to be discussed 
in Chapter 14. For such cases, the CPR model is particularly convenient for Gibbs sampling (Sec¬ 
tion 14.9); see, for example, [55]. 

Example 13.5. This example illustrates the computational evolution of the variational inference 
method for a two-dimensional Gaussian mixture model. The data are generated according to five sep¬ 
arate Gaussian distributions, with parameters 


and 


/*i = [—12.5, 2.5] r , p 2 = [-4,-0A] T , fi 3 = [2, -3.5] r , 


Z\ 


1.4 0.81 
0.81 1.3 

r 4 = 


IG = [10, 8] 

r , H 5 = [3,3f 

1 

' 1.5 0.2 ' 

r 

, — 

0.2 2.1 

, = 

0.5 0.22' 


' 1.5 1.4" 

0.22 0.8 

, = 

1.4 2.4 


1.6 

1 


1 

2.9 


for the means and covariance matrices, respectively. One hundred data points were generated from 
the Gaussian mixture, where each Gaussian was assigned an arbitrary number of points. Fig. 13.17 
depicts the data points as red circles. Variational inference on the model was performed based on the 
MATLAB® implementation of the method in [11]. A data set comprising equidistant and closely 
spaced test points in the area [—20, 15] x [—8, 12] was used to compute the approximate predictive 
distribution estimated by the variational inference method. The contours of the predictive distribution 
computed during the first, second, and fifth iterations of the algorithm are plotted in Fig. 13.17. The 
algorithm has clearly identified the clusters of the data. 


13.11.5 THE INDIAN BUFFET PROCESS 

We have discussed the clustering promoting nature of a DP, which was exploited in the context of the 
mixture modeling task. The CRP and the stick breaking construction were two paths to represent and 
implement a DP as a prior in practice. 


12 


http://sites.google.com/site/kenichikurihara/academic-software. 
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FIGURE 13.17 

Contours of predictive distributiori for Example 13.5, after (A) the first, (B) the second, and (C) the fifth iteration. 
Observe that finally five clusters have survived. 


We now tum our attention to a different task, which bears a close affinity to mixture modeling. 
Recall that mixture modeling imposes a structure on the observations, by assigning groups of them to 
the same mixture component. In the current section, a different type of structure will be imposed. 

At the heart of the current task lies the assumption that the observations are controlled via a set 
of unobserved latent variables that are stacked together in vectors, known as the feature vectors. The 
basic assumption of the model is that the available observations are linear combinations of the set of 
the unknown latent variables, i.e., 

x n = Az n , x n eR l , z n eK*, n — 1 , 2, ..., N . 

Depending on the relative size of / and K, we refer to the task with different names. For example, 
if K < l we refer to it as dimensionality reduction, since less variables are needed to describe the 
/-dimensional observations. 

Collecting all observations together, the previous equation is written as 

X = AZ, X e R lxN , Z e R KxN , (13.132) 

where X — [xi.x,y], Z — \ z. \,.... Z.v]. The task comprises the computation of A and Z, given 

the matrix X. In general, there are infinitely many Solutions to this problem. In practice, one has to 
mobilize further assumptions in order to restrict the possible set of Solutions. A number of such external 
constraints are treated in Chapter 19. Depending on the imposed conditions, the task is given a different 
name. In this subsection, we treat the task via nonparametric Bayesian arguments and we are going to 
impose a prior that will play the role of this extra “something” that will make the task well defined. 

The added difficulty is that the number of latent variables, K, is not known a priori. In the mixture 
modeling, when the number, K, of mixtures was not known, we left it going to inhnity, under a Dirichlet 
prior over the K mixture probabilities; we have already stated that this limit leads to the CRP. In a 


13 We use K to denote the number of latent variables to stress the fact that latent variables here play the role of the mixture 
components in the mixture modeling task. 
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similar rationale, we will allow the number of latent variables K —> oo, in order to come up with 
an appropriate prior. Moreover, the prior that will resuit should have some sparsifying properties, in 
a similar way that while the number of mixtures K was left to grow to infinity, the resulting prior 
promoted the formation of a small number of clusters/mixtures. 

Since sparsity becomes crucial in our discussion, we will assume that Z comprises zeros and ones. 
Before we go any further, let us elaborate a bit more on the meaning of the zeros. Take as an example the 
case of N — 5 and K — 3 and niake some of the matrix elements zeros. Then the matrix factorization 
in Eq. (13.132) takes the form 



Zll 

Z 12 

Z13 

0 

Z15 

I>1,*2,...,*5] = [«I,a2,a3] 

Z21 

0 

0 

Z24 

Z25 


0 

Z32 

Z 33 

Z34 

0 


The above implies that 

X\ = z n fli + Z 2 1 « 2 , *5 = £15«1 + Z25«2, (13.133) 

X 2 = Z 12«1 + Z32«3, *3 = Zl3«l + Z33«3, (13.134) 

X4 = Z24«2 + Z34«3- (13.135) 

It does not need much of a thought to realize that the existence of zeros in Z imposes a structure on the 
observations. Indeed, x i and at 5 lie in the same subspace, spanned by a\ and « 2 - Observations x j and 
X 3 lie in the subspace that is spanned by a 1 and < 13 , which is different from the previous one. That is, 
the existence of zeros imposes a clustering structure on the input data. 

Thus, our starting point is to select a prior that promotes zeros on Z. Furthermore, we will assume 
that the elements of Z are either 1 or 0, i.e., zkn e {0, 1}, k = 1, 2, .. ., K, n = 1, 2,..., N . Such a 

simplified treatment will reveal the secrets behind the method. Later on, one can generalize to the more 

practical terrain, where the nonzero values can take real values. For example, one can write Z' = Z o B, 
where o denotes element-wise multiplication, and then impose an extra prior on the values of B. 

Searching for a Prior on Infinite Binary Matrices 

The magic words behind the prior we are looking for is that “it should impose a clustering structure on 
the data.” Although this was the goal in the mixture modeling, the problem here is distinctly different. 

• Mixture modeling: Each observation belongs to (is emitted from) a single mixture component. The 
underlying structure results from grouping observations together in a number of different mixture 
components. Observations that are assigned in the same mixture are more “similar” than observa¬ 
tions that are assigned to different clusters. 

• Matrix factorization: Every observation is expressed as a linear combination of a number of columns 
of matrix A. That is, every observation vector is associated with a number of columns of A. Similar- 
ity between observations is established by collecting together observations that are associated with 
the same columns of A. 

In a slightly different jargon, the columns of A are known as features and if two observed vectors are 
given as a combination of the same columns, we say that they share the same features. If Zkn — 1, 
we say that the kth feature is present in the nth observation. 
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For a fixed value of K, let P k , k = 1,2,..., K, be the probability that Zkn — 1, for any value of n. 
Then we can write 

K N 

P(Z\P ) = nn P(Zkn\P) 

k=\ n= 1 
K 

= n p“ k (i-Pk) N ~ mk , 

k= 1 

where P — [Pj,..., PkY and m k Yl,n=\ Zkn ■ That is, m k is the number of nonzero elements in the 
kth row of Z; its physical interpretation is that it counts the number of observations that share the &th 
feature. Obviously, for Eq. (13.137) to hold true, independence among the involved variables has been 
assumed. 


(13.136) 

(13.137) 


Select a prior for the probabilities: As a prior for P k , k — 1.2..., K. we adopt the beta distribution, 
i.e., 

P k ~ Beta(P|a, b). 

Recall that according to the beta distribution, P e [0, 1]. Note that these K probabilities need not add 
to one. In contrast, as we pointed out in the case of CRP, the prior over the K mixture probabilities 
was taken to be the Dirichlet distribution. This is because they should add to one. Every point was 
necessarily emitted by any one among the K mixtures. In contrast, in our current setting, the &th 
feature is either shared by an observation, with probability P k , or not, with probability 1 — Pk. In order 
to get the limiting case, K —> oo, the parameters of the beta distribution are set equal to a — and 
b = 1. Such a choice makes the normalizing constant (Chapter 2) equal to 


B( 


a 

~k' 


r(f )r(D 
r(f +1) 


K 
a ' 


where the recursive property of the gamma function, T(x + 1) = xT(x), has been taken into account. 

Thus, adopting the previous prior, the elements of Z are generated according to the following 
model: 


Pk - 

~Beta(p|^, l) , 

^■kn ^ 

~Bem(z|Pfc), 


(13.138) 

(13.139) 


where the latter distribution is the Bernoulli distribution (Chapter 2). By the way, recall from Problem 
12.8 that the beta and Bernoulli distributions form a conjugate pair. 


Taking the limit K —> oo: The proof is a bit technical and can be found in [23] and Problem 13.24. 
However, we are going to comment on some of the steps involved in the proof, since these reveal 
some interesting properties. The hrst step in deriving a prior by taking the limit of the above model is 
to compute the probability P(Z). This is done by marginalizing (integrating) P k out in Eq. (13.137), 
taking into consideration that they are beta distributed. This results in a dependence of P(Z) that is 
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inversely proportional to K that tends to zero as K tends to infinity. This is natural. If K —> oo, the 
probability of any binary matrix, Z, to occur tends to zero. However, here comes a crucial point. What 
we are interested in is not P(Z) but something else! 

Equivalence classes of binary matrices : The probability P(Z) does not provide the representation 
we are looking for. The reason is that two matrices may be different, yet they can convey the same 
information. For example, assume N = 8 and consider the kth row of Z to be 

l\ = [ 1 , 0 , 0 , 1 , 1 , 0 , 0 , 

Also by definition. 


X = AZ = [ai,...,a k ,...,a K \ 


or focusing only on the contribution of the kth feature, we get 

X = [xi,x 2 . x s ] — ... + a k zl + ..•• 



Taking into account the specific values of the rows , given before, it is readily seen from the above 
that feature ak is shared only by the observations x \, X 4 , X 5 , and xg. However, the important attribute 
of the above is that the latter four vectors share the same feature. It is of no importance to us whether 
this feature is called k or 1 or 3 or whatever. This is in analogy to the mixture modeling, where one 
does not care if two observations are emitted by the, say, first or second mixture. What is important 
is that they are emitted by the same mixture component. Labeling features is of no importance. The 
crucial information is that a specific group of observations share a number of common features. In a 
more formal way, the structure of X does not depend on the order in which the columns of A and the 
corresponding rows of Z appear. The information related to the clustering structure of X is invariant 
to permutations of A and Z. We are now close to defining the equivalence class concept. 

Consider a specific matrix Z for some fixed value of K. Let us now generate ali possible per¬ 
mutations of the rows of Z. Then, as far as the clustering structure of the corresponding matrix X 
is concerned (feature sharing), all these permuted matrices are equivalent. We say that they form an 
equivalence class. Take, as an example, the following permuted matrices for K — 3 and N — 5: 


■ 1 

1 

0 

1 

0 ” 


■ 1 

1 

1 

0 

1 " 

1 

1 

1 

0 

1 

, 

1 

1 

0 

1 

0 

1 

0 

1 

0 

0 


1 

0 

1 

0 

0 


Observe that both matrices unveil the same structure for the input matrix. The only difference is that 
what we call feature #2 in the matrix on the left, we call #1 in the other. 
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Let us now denote by [Z] the set of ali matrices equivalent to Z. What we are really interested in 
is to compute the probability P(\Z\), for all possible equivalent classes. It turns out that as K —> oo, 
P([Z]) remains finite. The secret behind the difference between P(Z) and P([Z\) is that as K keeps 
increasing to infinity, the number of possible permutations, and as a consequence the cardinality of 
each equivalence class, also increases; this leads to a finite probability for each equivalent class. Note 
that each row comprises N zeros and ones. Hence, the maximum number of possible nonzero rows is 
2 n — 1. Thus, the nonzero rows of Z are formed by a random repetition of these binary numbers. What 
characterizes each equivalence class is that all its members are formed by the same binary numbers. 
For example, both matrices above consist of the same binary numbers, namely, 11010, 11101, 10100. 
The sequence in which these numbers occur is of no importance. We have touched upon all the required 
ingredients one needs to derive P([Z]). The proof is given in Problem 13.24. 

It turns out that 


lim P([Z]) oc a K+ exp (— u,Hn) , (13.140) 

K—>oo 

where K + is the number of nonzero rows in the class and //\; = y is a constant. The exact form 

of the involved constants in the above proportionality relation is of no importance to us (see, e.g., [23] 
for details). 

Having derived the above prior probability over the class of binary matrices, the issue now is how 
one can sample from or implement such a distribution in practice. In the same vein as with the DPs in 
Section 13.11.2, there are two alternatives. 


Restaurant Construction 

In analogy to the CRP, it turns out that the following metaphor is a way to sample from the prior in 
Eq. (13.140). Imagine that the columns of Z correspond to customers and that the rows correspond to 
dishes in an infinitely long buffet, which is inspired by the large variety of dishes in a typical Indian 
restaurant! This is the reason that the method is known as the Indian buffetprocess (IBP). 

1. The first customer, n = 1, takes K (] ' dishes, according to a Poisson distribution with parameter a, 
i.e.. 


P 



a K(l) exp(— a) 


2. The nth customer takes dishes that have been previously sampled with probability where m* 
is the number of customers who have sampled the A:th dish (feature). He/she also takes K in> new 
dishes, according to the Poisson distribution with parameter a/n, i.e., 


P 



a\K( n ) 
n ) 


eX P (— n) 


KW ! 


Thus, after N customers have been served, a matrix Z e x N is formed, with K + being the total 
number of dishes that have been sampled. It can be shown that following such a sampling procedure, as 
K —y oo, the probability of any equivalence class P(\Z\) is equal to that in Eq. (13.140) (e.g., [23]). 
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FIGURE 13.18 

The distributiori of dishes per customer. Each new customer selects some of the previously selected dishes and some 
new ones. Observe that the probability of selecting a new one decreases fast. 


It turns out that: 


• The number K + of nonzero rows of Z is also distributed according to a Poisson with parameter 
cxHn, i.e.. 


P(K+;aH N ) = 


(aHu) K+ exp (—a//yv) 


K + \ 


(13.141) 


• As K —> oo, matrix Z remains sparse. As a matter of fact, the number of nonzero elements follows 
a Poisson distribution with parameter aN and its mean value is a /V. Furthermore, the probability 
of nonzero values higher than the mean decreases exponentially. Also, taking into account that the 
mean value of a Poisson is equal to its parameter, the average number of dishes selected (features 
used) is equal to a //y (e.g., [23]). That is, for a fixed N, the larger the concentration parameter is, 
the more dishes are selected. 


In analogy to Fig. 13.14, Fig. 13.18 shows the distribution of dishes per customer for a specific choice 
of the a parameter. In CRP, each customer is associated with a single table. In CRP, the crucial point is 
how many customers sit at the same table. In IBP, each customer is associated with multiple dishes and 
the crucial point is how many customers select the same dishes. Note that the number of new dishes 
keeps decreasing fast. Dishes associated with the first (top) rows are shared by many customers. The 
dishes associated with rows to the bottom of the figure are shared by fewer and fewer customers. 

Fig. 13.19 shows the evolution of the number of dishes per customer, for two different concentration 
parameters. To avoid confusion, note that, for practical reasons for saving space in presenting the 
figure, customers correspond to rows and dishes to columns (in contrast to Fig. 13.18). The larger the 
concentration parameter is, the more dishes (features) are selected, which is in line with what has been 
said before in relation to Eq. (13.141). 
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Concentration a = 5 



Concentration a = 10 



Dishes 

FIGURE 13.19 

In this figure, customers are shown in rows and dishes in columns, for graphical convenience. Gray squares corre- 
spond to selected dishes. Note that as the concentration parameter increases, more dishes are selected. 


Stick Breaking Construction 

As was the case with the CRP, the restaurant construction of an IBP fits nicely when inference is done 
via Monte Carlo sampling, e.g., Gibbs sampling (Chapter 14). However, when variational approxima- 
tion methods are to be used, a stick breaking construction is more appropriate. 

Choose a beta distribution, Beta(AI«, 1). Sample Ai ~ Beta(AI«, 1) and set P\ — At - Then the A:th 
step of the algorithm is given by 


A* ~ Beta(A|a, 1), 

k 

P k =Y\Pj, k= 1,2,.... 

j=1 


(13.142) 

(13.143) 


Fig. 13.20 illustrates the process. Starting with a stick of unit length, we first break a piece of length 
Ai, which we keep and we set P\ = Ai ■ The remaining part of the stick of length 7Ti is thrown away. 
In the sequel, we break a piece of length A2A1 - and we set /b = A 1 A2 ■ We keep the piece and the other 
part of length 712 is discarded, and so on. Associating each Afc with the probability of the kth feature 
to occur, it can be shown that this construction of the sequence of probabilities is equivalent to an IBP 
process [79]. 

It is interesting to point out that the sequence 7 z> of the discarded segments implements a sequence 
related to the DP stick breaking construction given in Eq. (13.127) (Problem 13.25). This parallelism 
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Pi = 


P2 = 0102 


P 3 — 010203 


FIGURE 13.20 

The stick breaking construction of an IBP. At each iteration, we sample from a Beta(a, 1) and we keep part of the 
segment in proportion to the sample (left part), while we discard the remaining (right) part. We set the probabil- 
ities equal to the lengths of the segments that we keep. The lengths Ttk of the discarded parts correspond to the 
probabilities associated with a DP process. 

reveals nicely the difference between an IBP and a DP. In an IBP, the sequence of probabilities is a 
decreasing one and the respective values do not add to one. In contrast, the sequence of probabilities in 
a DP is not necessarily decreasing, and the respective values do add to one. 

Inference 

For the inference, the data generation model should be first explicitly written. To this end, adopting the 
stick breaking construction of an IBP, let us assume that the data follow a Gaussian distribution with a 
binary latent feature model; this leads to the following sequence of steps: 

1. Generate the beta variables, 0k ~ Beta(/S|o!, 1), k = 1,2,_ 

2. Generate the probability values, P^ — P|y =] 0j, k — 1,2. 3,_ 

3. Populate matrix Z with elements, zkn ~ Bern(z|/\), n = 1,2,..., N and k = 1.2,_ 

4. Generate the features in matrix A, ak ~ A r (0. aj /), k — 1,2. 

5. Generate the observations, x„ ~ Af( A Z, er^7), n — 1,2 ,..., N. 

The hidden/latent variables are (a) the stick breaking lengths, 0k , (b) the elements of Z, and (c) the ele¬ 
ments of matrix A. In practice, a truncated stick breaking process is considered, where we assume that 
Pk = 0, k > K. Thus, the matrix of the hidden/latent variables is W = [0, A, Z]. Given the set of ob¬ 
servations, X — {x 1 ,..., x ; .y), the posterior to be maximized with respect to the unknown parameters, 
a, er^ and er“ is given by 



However, it turns out that this is not in a tractable form and the mean field approximation technique 
can be mobilized (e.g., [15]). In analogy to Eq. (13.131), the factorized family for the variational 
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approximation of the posterior is given by 

K K N 

q(W) = ]""[ q Yk (fik) ]~[ q<s> k (a k ) ]"~[ q Vnk (z„k ), 

k= 1 k= 1 n =1 

where q Yk (fit;) are beta distributioris with variational parameters y k , the variational posterior estimates 
for the columns of A are Gaussians with parameters <f>/; (i.e., means and covariance matrices), and 
q Vnk (z.nk) are Bernoulli with parameters (probabilities) v nk . Inference is carried out by maximizing the 
bound in Eq. (12.68). Note that if the sparse matrix is not binary, then it is replaced by Z o B and 
the elements of B are also latent variables; different priors can be used depending on the task; for 
example, it can be Gaussian or Laplacian (see, e.g., [15,23]). In the latter reference, the alternative path 
to variational inference via Gibbs sampling techniques is discussed. 

Remarks 13.6. 

• In analogy to CPR, which draws samples according to a DP, it can be shown that the IBP draws 
samples according to the so-called beta process [79,88]. 

• By replacing the parameters (a, 1) in the beta distribution with the more general case Beta(«, b ), and 
changing their values, different distributions resuit. One such example is the so-called Pitman-Yor 
IBP. For such processes, the resulting probabilities decay in expectation following a power law; in 
contrast, in the IBP presented before, the decrease is exponentially fast (see, e.g., [62]). 


13.12 GAUSSIAN PROCESSES 

In Section 13.11, the way to impose priors onto the model was similar in spirit with that used for para- 
metric modeling techniques; that is, priors were imposed on the set of unknown parameters. In this 
section, a different rationale will be adopted. The prior will be placed directly over the space of nonlin- 
ear functions, rather than specifying a parametric family of nonlinear functions and placing priors over 
their parameters. 

Let us recall the nonlinear regression task given in Eq. (12.1), that is, 


K -1 

y = e 0 + ^2 Qkfikix) + 1 ) = Q 1 <l>(x) + i], (13.144) 

k= I 

where the parameters 0 are treated as a random vector. Let us define 

f(x) = 0 7 <j>(x), 

where f(:c) is a random process. From Chapter 2, we know that a random process is a random entity 
whose realization (the outcome of an experiment) is a function, f(x), instead of a single value. The 
idea that spans this section is to work directly on f(x) instead of the indirect approach of modeling it 
via the set of parameters, 0. This is not the hrst time we have adopted such a path. We silently did it in 
Chapter 1 1 while searching for functions in RKHSs. As a matter of fact, this section can be considered 
a bridge between the current chapter and Chapter 11 . 
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Recall from Chapter 11 that instead of expanding an unknown function in parameterized forni in 
terms of a number of preselected basis functions as in Eq. (13.144), we preferred to search directly 
for functions that reside in an RKHS; the optimization was carried out with respect to the function 
itself (not with respect to a set of parameters). In the context of the squared error loss function, the 
optimization was cast as 

N 

~ /(*»)) 2 + c ii / ii 2 ’ 

/eH „=l 

where || • || denotes the norm in H. The goal in this section is to state the “Bayesian counterpart” to this 
approach. To this end, we will focus on a specific family of processes, known as Gaussian processes, 
proposed in [56]. 

Definition 13.1. A random process, f(jc), is called a Gaussian process if and only if for any finite num¬ 
ber of points, X(i),..., X(N>, the respectivejoint probability density function, p (/Qc(i)),..., f(X(N)i), 
is Gaussian. 

We know that a set of jointly Gaussian distributed random variables is fully described by the respec- 
tive mean value and the covariance matrix. In a similar spirit, a Gaussian process is fully determined 
by its mean value and its covariance function, that is, 

dx = E [f(x)], co v f (x, x') = E [(f(x) - p, x )(f(x’) - p x ')]. 

A Gaussian process is said to be stationary if /i x — p and its covariance function is of the form (see 
also Chapter 2) 

COVf (x, x') = co v/(x — x'). 

In addition, if covy (-, •) depends on the magnitude of the distance between x and x (i.e., ||jc — x' ||), 
the Gaussian process is called homogeneous. From now on, we will assume fi x = 0. Before we proceed 
further, let us establish another connection with Chapter 11 . 

13.12.1 COVARIANCE FUNCTIONS AND KERNELS 

For any N and any collection of N points, x ( ij,..., x ( ,v). the respective covariance matrix is defined 

by 

r = E[ff 7 '], 

where 

f:=[f(x ( i)),...,f(x w )] r , (13.145) 

with elements given by 

[£]jj = cov f (X(i),X(j)), i, j — 1,2,..., N. 

Because E is a positive semidefmite matrix, this guarantees that the covariance function is a kernel 
function (Section 1 1.5.1). To stress this, from now on we will use the notation 


covf Qc, x') — k(x , x'), 
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and the covariance matrix becomes the corresponding kernel matrix denoted as /C (Chapter 11). 
This change of notation will make the connections with RKHSs readily spotted. Some typical examples 
of kernel functions used for Gaussian processes are: 

• Linear kernel. 

k(x, x') — x T x'. 

Note that this kernel does not correspond to a stationary process. 

• Squared exponential or Gaussian kernel'. 



where Ii is a parameter determining the length scale of the process. The larger the value of h, the 
larger the “statistical” similarity (stronger correlation) of two points having a distance d = | |jc — jc'11 
apart. 

• Ornstein-Uhlenbeck kernel'. 



• Rational quadratio kernel: 



Recall from Chapter 2, where random processes were first presented, that a stationary covariance 
function/kernel has as its Fourier transform the power spectrum of the respective random process; by 
dehnition, the power spectrum of a process is a nonnegative function in the frequency domain. This 
suggests a way of constructing kernels for random processes; that is, take the inverse Fourier transform 
of a positive function in the frequency domain. Moreover, in principle, all the rules for constructing 
kernels, which are discussed in Section 11.5.2, can also be applied to construet covariance functions. 
For example, a popular choice of a kernel for a Gaussian process is 



where 0 \, (h are hyperparameters, which dehne the process. 

Fig. 13.21A shows examples of different realizations of a stationary Gaussian process using the 
Gaussian covariance kernel with h = 2, and Fig. 13.21B for h — 0.2. 

13.12.2 REGRESSION 

Let us assume that we are given a set X of input observations, X — {x\.... ,x^}. Recall from Sec¬ 
tion 12.2 that the main goal in a Bayesian regression task is to obtain the two PDFs, 


p(y\X) and p(y\x, y, X), 
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Different realizations of a Gaussian process. Gaussian covariance kernel with (A) h = 2 and (B) h = 0.2. Note that 
when the correlation function fades away fast, the graph of the respective realizations shows a fast variation as a 
function of the free variable (x). 


where 


y = f+n), y ;= [yi, — y^] r . (13.146) 

and 

y = f(*) + T|, 

and f is defined in Eq. (13.145). The first of the two PDFs is the joint probability density of the output 
variables, which are generated by input points in X ; the associated randomness is due to f as well as 
to the noise r|. The second PDF refers to the prediction of the value of y, given the value of x and the 
training data (y n ,x n ), n = 1,2,..., N. We will omit X to unclutter notation, as we did in Section 12.2. 

Assuming f(-) to be a zero mean Gaussian process, f is jointly Gaussian with zero mean and covari¬ 
ance matrix /C, dictated by the covariance function/kernel k (■, •), that is, 

p(f) =AA/|0,/C). 

Also, let r| be of zero mean with covariance matrix and independent of f(-); without harming 
generality, let — a~I. Thus, 

p(y|/)=A/"(y|/,CT“/). 

Then, following exactly the same arguments as in Section 12.2, we obtain 

p(y)=Af(y\0,IC + ^I). (13.147) 

This is also obvious from the fact that the sum of two independent Gaussian variables is also Gaussian 
and the mean and covariance matrix can directly be obtained from Eq. (13.146). 
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To obtain p(y\x, y) we can use (13.147) and apply it recursively. It will also be useful here to bring 
into the notation the number of available observations, N, explicitly and write 


J7v+i 


> 

y n 


y N :=[yi,...,y N ] T . 


From Eq. (13.147), y w+1 follows a Gaussian distribution 

/AyAr+ilo, rjv+t), 


with 


XV+i :=/Cjv+i + °^Av+i- 

Then, from the Bayes theorem, we have 


p(y\y N ) = p(y ( N+ \ ) ■ ( 13 - 148 ) 

P&n) 

However, because the joint PDF is Gaussian, the conditional in Eq. (13.148) is also Gaussian. The 
respective mean and variance are computed by partitioning the matrix En+\ (see Appendix of Chap- 
ter 12, Eqs. (12.134) and (12.133)) 


%N+i = 


ic(x,x) + cr,“, 
k(x), 


X T (X) 

En 


k( x) := [/c(x, xi),..., /c(x,xn)] T , 


i.e., 

| I (13149) 

(Ty(x) =a n +/c(x, x )-k (x)E n k(x). 

Compare Eq. (13.149) with Eq. (1 1.27). Taking into account that En — /C,y + a /x v (x) is identical 
to y obtained by the kernel ridge regression, for appropriate choices of C and oA. However, now we 
have also obtained information concerning the respective variance of the resulting estimate. 

At this point, it is interesting to look back at the Bayesian regression task for parametric modeling 
in Section 12.2.3, and to remember that the obtained mean value in Eq. (12.20) was the same (for a 
zero mean prior p(0)) as that provided by the ridge regression, for an appropriate choice of X. 

Remarks 13.7. 

• From the previous discussion it is apparent that solving the regression task by resorting to Gaussian 
processes is the Bayesian answer to solving a regression task in an RKHS. Both approaches share a 
common advantage. Although the underlying mapping to an RKHS (implied by the adopted kernel) 
may live in a high-dimensional space, the complexity for solving the task depends on the number of 
training points, N. The source of complexity associated with the Gaussian processes is the inversion 
of the matrix, which amounts to O(N 3 ) operations. 
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• Both equations in (13.149) can be obtained from the corresponding equation derived for the linear 
case of Bayesian learning, covered in Section 12.2.3. Indeed, setting 6 o = 0 in Eq. (12.27) and 
combining it with Eq. (12.22), we obtain 



(13.150) 


where X has replaced <t>, because the linear case is treated. Applying now the kernel trick, as dis- 
cussed in Chapter 1 1, to replace (ToxJ x j with a kernel k(xi , x j ) operation, one readily obtains the 
corresponding equation in Eq. (13.149). 

In a similar way, one can obtain er“(jc) in Eq. (13.149) from Eq. (12.23) by using Woodbury’s 
formula for matrix inversion from Appendix A. 1 to reformulate Eq. (12.23) according to Eq. (12.28) 
(try it). 

Dealing With Hyperparameters 

As we have already stated, the kernel function can be given in terms of some parameters, say, 0 , 
which in turn have to be estimated from the data. There are various ways to deal with this task. The 
first that comes to mind is to optimize the resulting parameterized log-likelihood. In p(y: 6), with 
respect to 0 , by taking the gradient and equating to zero. Another way is to assume a prior on the 
parameters and use Bayesian arguments to integrate them out. The integration is usually intractable 
and approximate techniques must be used, for example, Monte Carlo methods (Chapter 14). Needless 
to say, both techniques have their drawbacks. Optimizing the log-likelihood is a nonconvex task that 
cannot guarantee, in general, a global maximum. On the other hand. Monte Carlo techniques tend to be 
computationally intensive, requiring many iterations to converge. More on these issues can be found in 
[ 66 ]. 

Computational Considerations 

In order to reduce the 0(N 3 ) computational load associated with the inversion of X\y, a number of 
approximate techniques have been proposed. A possible path is the sparse Gaussian processes ; in 
these methods, the full Gaussian process model is approximated by using an expansion in terms of 
a finite set of basis functions. For example, it is common to use as bases the set K(x,u m ), where 
u m , m — 1,2,..., M < 5 iC /V, is a subset of the input samples known as active set. Such techniques can 
lead to a reduced cost of the order of (D(M 2 N ) (e.g., [65]). Other alternatives that do not require 
the active set to be a subset of the training samples have also been proposed (e.g., [40,77]). In [87], 
a variational sparse method is proposed that attempts to alleviate problems encountered when one 
increases the size of the active set. 

A variation of the Gaussian processes approach is to equip it with the ability to forget past samples 
for time-varying environments; this method has been proposed in [61,89] as an alternative to the kernel 
RLS algorithm discussed in Chapter 11 . Other variants use transformations of the output variables to 
make Gaussian models applicable to a wider range of problems [41,76]. 

In [70], the connection between Gaussian processes and Kalman filtering is exploited and the solu- 
tion is obtained via the involvement of stochastic differential equations, which makes the dependence 
of the complexity on time to be linear. 

Finally, an extended review of related techniques is provided in [43]. 
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13.12.3 CLASSIFICATION 

In contrast to the regression task, under the Gaussian assumptions for the noise and the involved ran- 
dom process, the classification task gets more involved. In Section 13.7, the logistic regression in its 
parametric form, given in Eq. (13.68), was employed. In the context of the Gaussian processes, the 
model becomes 


P(co\\x) 


1 

1 + exp ( - /(*)) 


= a(/(*)), 


where now f{x) will be treated in terms of a Gaussian random process, associated with a kernel 
function k(-, •). Given a set of training samples, (y„, x„), n = 1.2,..., /V. y„ e {0, 1}, and following 
the same arguments as in Section 13.7, we can now write 


N 

n =1 


where /„ := /(*„), and 


p(f)=Af(f\0,IQ. 

Note that P(y\f) is no longer Gaussian and the involved integrations needed to obtain P(y) and/or 
P(y|x, y) cannot be performed analytically. There are various ways to perform approximations. One 
path is to resort to the Laplacian approximation of p(f(x)\y) (see Section 12.3) [90]. Another is to 
use Monte Carlo techniques [53]. In [20], a variational approach has been used to obtain bounds on the 
logistic sigmoid and approximate the respective product with a product of Gaussians. The expectation 
propagation method has been used in [57]. 

For further reading on Gaussian processes, the interested reader may consuit the classical reference 

[ 66 ]. 

Example 13.6. The goal of this example is to demonstrate the usage of Gaussian processes in regres¬ 
sion. To this end, N — 20 points were randomly sampled from a realization of a Gaussian process, with 
zero mean and covariance function based on the Gaussian kernel with length scale h = 0.5. The corre- 
sponding input points were drawn according to a normal distribution of zero mean and unit variance. 
In the sequel, Gaussian noise was added to these Gaussian process points, with variance 0.01, to form 
the set of observed data (shown as “+” in Fig. 13.22). Using these as the training data, predictions of 
the output variables, corresponding to D — 1000 equidistant input points in the interval [—3, 4], were 
performed; for the prediction, the expressions for the posterior Gaussian process mean and variance 
in Eq. (13.149) were used. The mean of the posterior Gaussian process is illustrated in Fig. 13.22 as 
a solid red line, The shaded area surrounding the curve of the posterior mean corresponds to the er¬ 
ror bars [i y ± 2er v of the posterior prediction. Note the increase of the posterior prediction variance in 
regions where observed data points are scarce. 
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FIGURE 13.22 

The red line corresponds to the mean of the posterior Gaussian process. The shaded area corresponds to ± twice the 
Standard deviation. 


13.13 A CASE STUDY: HYPERSPECTRAL IMAGE UNMIXING 

Hyperspectral image unmixing (HSI) is a typical application of sparse regression modeling under a set 
of constraints. It is a good “excuse” for us to demonstrate the application of the hierarchical Bayesian 
modeling approach via a task of great practical importance. 

In hyperspectral remote sensing , the electromagnetic solar energy emanating from the earth’s sur- 
face is measured by sensitive scanners located aboard a satellite, an aircraft, or a space station. The 
scanners are sensitive to a number of wavelength bands of the electromagnetic radiation. Different 
properties of the eartffs surface contribute to the reflection of the energy in the different bands. For ex- 
ample, in the visible-infrared range, properties such as the mineral and moisture contents of soils, the 
sedimentation of water, and the moisture content of vegetation are the main contributors to the reflected 
energy. In contrast, at the thermal end of the infrared, it is the thermal capacity and thermal proper¬ 
ties of the surface that contribute to the reflection. Thus, each band measures different properties of 
the same patch of the earth’s surface. In this way, images of the earth’s surface corresponding to the 
spatial distribution of the reflected energy in each band can be created. The task now is to exploit this 
information in order to identify the various ground cover types, that is, built-up land, agricultural land, 
forest, fire burn, water, diseased crop, and so on. 

Fig. 13.23 illustrates the process of generating a pixeTs spectral signature out of a hyperspectral im¬ 
age data cube (the cube consists of two spatial and one spectral dimension). Each image corresponds 
to a single wavelength (band) and each pixel to a specific patch of the earth’s surface. The spectral sig¬ 
nature of a pixel is simply a vector containing radiance values measured in the various spectral bands. 
Technological advances in recent years have allowed the implementation of imaging spectrometers, 
which have the ability to collect data in hundreds of adjacent spectral bands. The highly increased vol- 
ume of data conveys spatial/spectral information that can be properly exploited to accurately determine 
the type and nature of the objects being imaged. 

An intimate limitation of hyperspectral remote sensing is that a single pixel often records a mixed 
spectral signature of different distinet materials, due to the low spatial resolution of the remote sensor. 
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This raises the need for spectral unmixing (SU) [37], which is a very important step in hyperspectral 
image processing that has recently attracted strong scientific interest. SU is the procedure of decompos- 
ing the measured spectrum of an observed pixel into a collection of constituent spectral signatures (or 
endmembers ) and their corresponding proportions (or abundances). A widely used model to perform 
SU is the linear mixing model. 

Assume a remotely sensed hyperspectral image consisting of M spectral bands, and let y e R M be 
the vector containing the measured spectral signature (i.e., the radiance values in ali spectral bands) of 
a single pixel (specific earth patch). Also let X — [x i , X 2 , ■ ■ ■, */] stand for the M x I endmember sig¬ 
nature matrix, where x ,■ e R' w , i = 1,2, ..., Z, comprises the spectral signatures of the /th endmember, 
and / is the total number of (possible) distinet endmembers (earth surface/material types) present in the 
scene. Finally, let 0 — [6 \, 02,, 0/] T be the abundance vector associated with y , where 0; denotes 
the abundance fraction of jc,- in y. The linear mixing model assumes that there is a linear relationship 
between the spectra of the measured pixel and the endmembers, expressed as 

y = X0 + ri , (13.151) 

where i] stands for the additive noise values, which are assumed to be samples of a zero mean Gaussian 
distributed random vector, with (i.i.d.) elements, that is. x\ ~ AT(ri\0, /3~ l Im), where /3 denotes the 
inverse of the noise variance (precision), and Im is the M x M identity matrix. Note that the model 
in Eq. (13.151) is a typical regression model in its multivariate formulation, because now the output 
for each measurement is a vector and not a scalar (see also Section 4.9 of Chapter 4). The output 



400 


Wavelength, nm 


2500 


FIGURE 13.23 

Each image corresponds to a specific wavelength band and each pixel to a particular patch of the earth’s surface. 
The signature of a pixel is a vector whose coefficients measure the radiance of the respective patch of the earth in 
the different bands (modified image taken from [69]). 
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variables are measured and the matrix X is assumed known, and indeed there are methods to estimate 
its elements. 

Treating such a model to recover the abundance coefficients would be a straightforward application 
of what has been said so far in the current and previous chapters of this book. However, there is a 
physical constraint that has to be considered and that makes the task more interesting. The abundance 
coefficients are nonnegative, that is, 

6i > 0, i = 1,2,... ,1. (13.152) 

Additionally, a valid assumption is that only a few of the endmembers present in the image will con¬ 
tribute to the spectrum of a single pixel y. In other words, the abundance vector 0 accepts a sparse 
representation in X. 

Thus, our goal is to estimate 0 subject to the nonnegativity as well as the sparsity constraints, given 
the spectral measurements, y, and the endmember matrix, X. Obviously, there are different paths to 
achieve this goal. Because we are currently exploring the Bayesian world, we will employ the Bayesian 
framework. To this end, an appropriate prior model that expresses our prior belief on the parameters of 
interest will first be adopted, and we will then perform Bayesian inference using the variational Bayes 
methodology, as has been previously discussed. 

13.13.1 HIERARCHICAL BAYESIAN MODELING 

The presence of Gaussian noise in Eq. (13.151) dictates that 

P (y\0,p)=m y \x0,r l iM) 

M M ( B „ n\ 

= (2tt)-t /iTexp ( -|||y - X0 \\ 2 j . (13.153) 

We now turn our attention to selecting suitable priors for the model parameters, which are treated as 
random variables, 0, fi. As a prior for the nonnegative noise precision fi we adopt a Gamma distribution 
(Section 13.3, Eq. (13.25)), expressed as 

d c 

p(P) = Gamma(/l|c, d) = -/0 c-1 exp (— dfi ), (13.154) 

T(c) 

where c and d are the respective parameters (set equal to 10 -6 in the experiments). 

For the abundance vector 0, we dehne a two-level hierarchical prior that is expressed in a conjugate 
form and imposes sparsity as well as nonnegativity on the abundance coefficients. Inspired by [68], 
a nonnegatively truncated Gaussian prior is selected, i.e., 

/Atfla^A/^^IO.A- 1 ), (13.155) 

where a := [oq, a. 2 , ■ ■ ■, &l] T is the precision parameter vector, A — diagfaq,..., a/} is the correspond- 
ing diagonal matrix, and A/” R z signifies the /-variate normal distribution truncated at the nonnegative 

orthant of R/, denoted by [81]. In the second level of hierarchy, the precision parameters are also 
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considered random variables, a i, i = 1,2,...,/, that follow an inverse Gamma distribution, that is, 

p(ui) =IGamma |a,|l, yj = ya“ 2 exp , (13.156) 

where bi, i = 1,2,N, are scale hyperparameters. These two levels of hierarchy form a nonnega- 
tively truncated multivariate Laplace prior over the abundance vector 0, which can be established by 
integrating out the precision a [81], that is, 

/ 

p(0 |b, A) = (-sfWi\0i\) Ij^m, (13.157) 

;'=1 


where / R / ( 0 ) is the indicator function, with / R / (0) = 1 (resp. 0) if 6 e R' + (resp. 6 £ M^). In our 
formulation, the sparsity promoting scale hyperparameters in Eq. (13.156) are also assumed to be 
random and are inferred from the data, by assuming the following Gamma prior distribution for each 
b ; -, i = 1,2 ,...,/: 


p(bj) = Gamma(/>; |ic, v) = 


- b K ; 1 exp (— vbj ). 

T(/c) 1 


(13.158) 


Hyperparameters k and v in Eq. (13.158) are also set to small values (10 -6 in the experiments). 

Having adopted the hierarchical Bayesian model, the variational EM algorithm discussed in Sec- 
tion 13.3 is applied with the goal of obtaining estimates, q(6,), i = 1, 2,...,/, of the posteriors of the 
abundance parameters given the observations. In the experiments, the respective mean values of q(0 ,) 
will be used as estimates of the unknown parameter values. Details on the derivation can be obtained 
from [82]. The alternative path to the variational EM algorithm is to employ Monte Carlo techniques 
(see, e.g., [14]). 


13.13.2 EXPERIMENTAL RESULTS 

The previously described model was applied to a real hyperspectral image, collected by the Airborne 
Visible/Infrared Imaging Spectrometer (AVIRIS) over a Cuprite mining district in Nevada in the sum- 
mer of 1997. The Cuprite data set has been extensively used to evaluate remote sensing technologies 
and spectral unmixing algorithms (e.g., [30,51,81]). It comprises 224 spectral bands in the range from 
400 to 2500 nanometers. A subimage of the Cuprite data set with size 250 x 191 pixels is used in our 
experiments. Fig. 13.24 displays a composite of our image, where bands 183, 193, and 203 have been 
used. 

After removing some low signal-to-noise ratio (SNR) bands and water vapor absorption bands, M = 
188 spectral bands remain available for processing. As a preprocessing step, the VCA algorithm has 
been used to extract 14 endmembers from our hyperspectral image, as in [51]. The vertex component 
analysis (VCA) algorithm identifies the signatures of the “pure” pixels in the image and considers 


14 The data are publicly available at http://aviris.jpl.nasa.gov/data/free_data.html. 

15 The VCA code is available at http://www.lx.it.pt/~bioucas/code.htm. 
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FIGURE 13.24 

Composite of the AVIRIS Cuprite subimage using bands 183, 193, and 203 (from [69]). The full RGB color image 
is available from the site of this book. 


them pure material signatures. A plot of the spectral signatures of the extracted endmembers versus the 
wavelength is displayed in Fig. 13.25. 

Fig. 13.26 shows the resulting abundance maps for six different endmembers, using the variational 
Bayes method. A dark (resp. light) pixel reveals a low (resp. high) proportional percentage for the 
respective endmember in that pixel. In other words, each image shows the distribution of values of a 
specihc abundance coefficient, 0 ,, over the sensed earth surface. 

More important, we are able to identify the presented endmembers in Fig. 13.26 as muscovite, 
alunite, buddingtonite, montmorillonite, kaolinite 1, and kaolinite 2. 


PROBLEMS 

13.1 Show Eq. (13.5). 

13.2 Show Eq. (13.38). 

13.3 Show Eqs. (13.43)-(13.45). 
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FIGURE 13.25 

Spectral signatures of six out of the 14 endmembers extracted from the Cuprite image using the VCA algorithm 
[51]. A figure showing all 14 signatures can be downloaded from the site of this book. 


13.4 Show that if 

1 

p(x ) oc 

X 

then the random variable z := lnx follows a uniform distribution. 

13.5 Derive the lower bound after convergence of the variational Bayesian EM for the linear regres- 
sion task, which is modeled in Section 13.3. 

13.6 Consider the Gaussian mixture model 

K 

P(x) = PkN{x\p k , Q k l ), 

k= 1 

with priors 

p(H k )=Af(p k \0,r 1 I) (13.159) 

and 

p(Qk) = W(Qk\v 0 ,W 0 ). 

Given the set of observations X — {xj,..., x e K 7 , derive the respective variational 
Bayesian EM algorithm, using the mean field approximation for the involved posterior PDFs. 
Consider P k , k = 1,2,..., K, as deterministic parameters and optimize the respective lower 
bound of the evidence with respect to the P k s. 

13.7 Consider the Gaussian mixture model of Problem 13.6, with the following priors imposed on 
|x , Q, and P: 


p(p, Q ) = p(p\Q)p(Q) 

K 

= n^(^,io, deo -1 ) me*ivo, w 0 ), 

k= 1 
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FIGURE 13.26 

Estimated abundance maps for the materials (A) muscovite, (B) alunite, (C) buddingtonite, (D) montmorillonite, 
(E) kaolinite 1, and (F) kaolinite 2. The full-color image is available from the site of this book. 


that is, a Gaussian-Wishart product, and 


K 

p(P) = Dk(P\a)cxy[p£-\ 

k =1 

that is, a Dirichlet prior. Thus, P is treated as a random vector. Derive the E algorithmic steps of 
the variational Bayesian approximation, adopting the mean field approximation for the involved 
posterior PDFs. We have adopted the notation p in place of \i \. K and Q in place of Q \ -k for 
notational simplicity. 

13.8 If p, and Q are distributed according to a Gaussian-Wishart product, 

p(p, Q) = &Qr l )W(Q\v, W), 
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compute the expectation 

E[|L r Qp,]. 

13.9 Derive the Hessian matrix with respect to 0 of the cost function 


/V 

J (0) = ^[y«lnor(0 r U„)0) + (1 -y„)ln(l -a{<j> 1 (x„)0)) 


n= l 


1 T 

-0 T A0, 

2 


where 


<?(z) - 


1 


1 + exp(-z) 


13.10 Show that the marginal of a Gaussian PDF with a gamma prior on the variance, after integrating 
out the variance, is the Studenfs t PDF, given by 


st(.r|/r, k, v) = 


H4 1 ) / X \ 1/2 

r(S) VW 


1 


v+l 

(i + ^=^) 2 


(13.160) 


13.11 Derive the pair of recursions Eqs. (13.62)— ( 13.63). 

13.12 Consider a two-class classification task and assume that the feature vectors in each one of the 
two classes, u>\ , u> 2 , are distributed according to the Gaussian PDF. Both classes share the same 
covariance matrix U, and the mean values are /r | and /r 2 , respectively. Prove that, given an 
observed feature vector, rei 1 , the posterior probabilities for deciding in favor of one of the 
classes is given by the logistic function, i.e., 




1 

1 + exp (— 0 1 x + do) 


where 

0 := - At) 


and 

1 T _ 1 P(ttfl) 

9q= -(0-2 - 0-\) Z (A 2 + Ai) + In ——-. 

2 P{w 2 ) 

13.13 Derive Eq. (13.74). 

13.14 Show Eq. (13.75). 

13.15 Derive the recursion Eq. (13.77). 

13.16 Show that if / is a convex function, / : R/ 
gate, i.e., (/*)* = /• 


K, then it is equal to the conjugate of its conju- 
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13.17 Pro ve that 

X _ 

f(x) — ln — — X*Jx, x > 0 

is a convex function. 

13.18 Derive variational bounds for the logistic regression function 


ct(x) = 


1 


1 + e 


one of them in terms of a Gaussian function. For the latter case, use the transformation t = -Jx. 

13.19 ProveEq. (13.100). 

13.20 Derive the mean and variance of G(7)-) for a DP process. 

13.21 Show that the posterior DP, after having obtained n observations from the set 0, is given by 


G |0i,...,0„~DP a- 




13.22 The stick breaking construction of a DP is built around the following rule: P\ = ft\ 
Beta(/3| 1, a) and 


Pi~ Beta(j3|l,a), (13.161) 

i—1 

Pi=Pi\\{\-Pj), i >2. (13.162) 

7=1 


Show that if the number of steps is finite, i.e., we assume that P, =0, i > T, for some T , then 

Pt = 1. 

13.23 Show that in CRP, the cluster assignments are exchangeable and do not depend on the sequence 
that customers arrive, up to a permutation of the labeis of the tables. 

13.24 Show that in an IBP, the probabilities for P(Z ) and the equivalence classes P([Z]), are given 
by the formulas 


P(Z) = Y\ ^ r hk + f)r(N-m k + D 


and 


P([Z]) = 


k=\ 


K\ 


r (^+ 1 + f ) 


PJ a r (m k + f)r(N-m k + l) 


nto'^L\ K r(« + i+f) 


respectively. Note that K /,, h = 1,2,..., 2 N — 1, is the number of times the row vector associ- 
ated with the /?th nonzero binary number appears in Z. 

13.25 Show that the discarded pieces, jr*, in the stick breaking construction of an IBP are equal to the 
sequence of probabilities produced in a DP stick breaking construction. 
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MATLAB® EXERCISES 

13.26 Generate N — 60 data points from each of the five Gaussian distributions given in Exam- 
ple 13.1. Implement the EM algorithm to obtain estimates of the parameters of the Gaussian 
mixture model (Exercise 12.18). Run the EM algorithm on our generated data, assuming 
K — 25 clusters, using randomly chosen values for the initial mean values and the covariance 
matrices. Next, implement the variational Bayes algorithm that treats the same problem, ac- 
cording to the steps reported in Section 13.4. Plot the initial and final estimates of the EM and 
the variational Bayes algorithm to reproduce the results of Fig. 13.4. Play with different values 
of the parameters. 

13.27 Generate a vector comprising N — 100 equidistant sampling points x n in the interval [—10, 10]. 
Compute N basis functions, each one located at a sampling point x„ , of the form <p n (x) = 
exp (x — x n ) 2 /2<7^J, where er^ = 0.1. Select two of the basis functions randomly to compute 
the output samples, y n , according to the regression model of Example 13.2. The additive noise 
power should correspond to an SNR level of 6 dB. Implement the EM algorithm expressed in 
Eqs. (12.43), (12.44), (12.51), and (12.52), in order to fit a (generalized) linear regression model 
comprising the N basis functions to the generated data y n . Also, implement the variational 
Bayes EM, summarized in Algorithm 13.1. Plot the reconstructed signals and compare the 
results. 

13.28 Generate N — 150 two-dimensional data points x n , uniformly distributed in the region 
[—5, 5] x [—5, 5]. Assign a binary label to each x„, depending a) on which side of the graph of 
the function 

/(x) = 0.05x 3 + 0.05x 2 + 0.05x + 0.05 

the point lies and b) on the value of a noise variable. To generate the training data, for each 
sample x n — [x n \,x n 2 \ T , compute 

y„ = 0.05x 3 j + 0.05.«,^ + 0.05x„i + 0.05 + r), 

where r/ stands for zero-mean Gaussian noise of variance er 2 = 4. If x „2 > y„ assign x„ to o>\ , 
otherwise assign it to class a >2 . Download and run the MATLAB code of the RVM classifier 
for the generated data set. Use the Gaussian kernel with a 2 = 3. Repeat the experiments with 
different values of a 2 . Plot the points x„ using different colors for each class. Plot the obtained 
decision curves (classifier) and discuss the results. 

Download the MATLAB® code for the CRP mixture model from http://sites.googIe.com/site/ 
kenichikurihara/academic-Software. Generate two-dimensional data from the Gaussian mix¬ 
ture model of Example 13.5 and reproduce the results in Fig. 13.17. 

Consider a one-dimensional Gaussian process with zero mean and Gaussian (kernel) covariance 
function, with length scale h — 0.5 

(a) Sample D — 100 equidistant input points in the interval [—2, 2]. Use these as input points 
to compute the covariance function of the Gaussian process and form the respective 
100 x 100 covariance matrix. Use the corresponding multivariate Gaussian to generate 


13.29 

13.30 


16 


The RVM Software can be found at http://www.miketipping.com/sparsebayes.htm. 
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samples for five different realizations and plot the results, as in Fig. 13.21. Repeat the 
same experiment with different values for the parameter h. 

(b) Now, sample N = 20 input points from a zero mean, unit-variance normal distribution. 
Based on these input points, evaluate the covariance function and the respective 20 x 20 
covariance matrix, as before. Then, generate noisy Gaussian process data, by first sam- 
pling N points from our Gaussian process, and then adding zero mean Gaussian noise 
with variance 0.1. Next, sample D — 100 points in the interval [—3,4]. Compute the 
corresponding mean and the variance of the predictive Gaussian process, as given in 
Eq. (13.149). In a single figure, plot the observed data, the posterior mean, and the er¬ 
ror bars of the predictive mean, as in Fig. 13.22. 

13.31 Reproduce the hyperspectral unmixing results of Fig. 13.26 by running the script "HSIvB.m,” 
which is available at the website of the book. 
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14.1 INTRODUCTION 

In Chapters 12 and 13, the Bayesian inference task was considered. A large part of the latter chapter was 
dedicated to dealing with approximation techniques, which offered escape routes when the involved 
PDFs were complex enough to render integral computations intractable. Ali these techniques were 
of a deterministic nature; that is, the goal was to approximate the mathematical expression of the 
corresponding PDF by another one that could ease the associated calculations. Such methods include 
the Laplacian approximation, as well as the variational methods based on the mean field theory or 
the convex duality concept. Deterministic approximation methods will also be used for approximate 
inference in Chapter 15, to deal with graphical models. 

In this chapter, we turn our attention to approximation methods with a much stronger statistical 
flavor, which are based on randomly generating samples using numerical techniques; these samples 
are typical of an underlying distribution, which may be of either continuous or discrete nature. This 
is an old field, with origins tracing back to the late 1940s and early 1950s in the pioneering work of 
Stanislav Ulam, John Von-Neumann, and Nicholas Metropolis in Los Alamos, when the term Monte 
Carlo was coined as an umbrella name of such techniques, inspired by the famous casino in Monaco 
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(see, e.g., [28] for a historical note). The first application of such techniques, which coincided with 
the development of the first computers, was in the context of the Manhattan project for developing 
the hydrogen bomb; soon after, Monte Carlo methods were embraced by almost every scientific area 
where statistical computations are involved. 

As is often the case with pioneering ideas, when they are looked at a posteriori, that is, once they 
have been stated, the basic idea seems simple. Our current task of interest is the computation of an 
integral, which involves a PDF; this can alternatively be interpreted as the computation of an “expecta- 
tion.” Such a view provides the permit to approximate the integral as the sample mean of the involved 
quantities, given a sufficient number of samples and exploiting the law of large numbers. 

To condense a field with a history of a number of decades in a single chapter is obviously impos- 
sible. Our goal is to present the basic concepts, definitions, and directions, with the aim of serving the 
needs associated with typical machine learning tasks rather than looking at it as an entity on its own. 

We start with the more classical methods using transformations and then move on to the rejection 
and importance sampling techniques. In the sequel, the more powerful methods based on arguments 
from the theory of Markov chains are reviewed. The Metropolis-Hastings and Gibbs sampling meth¬ 
ods are presented and discussed. Finally, a case study concerning the change-point detection task is 
considered. 


14.2 MONTE CARLO METHODS: THE MAIN CONCEPT 

Our starting point is the evaluation of integrals of the form 

/ OO 

f (x)p(x) dx, (14.1) 

-OO 

where x e K f is a random vector and pix) is the corresponding distribution. Our interest lies in cases 
where the forms of f(x) and/or pix) are such that the evaluation of such integrals is intractable. 
For example, such integrations occur in the evaluation of the evidence function (Eq. (12.14)), in the 
prediction task (Eq. (12.18)), and in the E-step of the EM algorithm (Eq. (12.40)). In Eq. (12.14), the 
random variable is the parameter vector 0 and f(0) — p(y\0). 

Corning back to Eq. (14.1), assume that one has at her/his disposal a number of i.i.d. samples, 
x\,..., x :V, drawn from pix). Then the approximation 

1 N 

E [/(x)] ~ - J2 fixi) := E f , N (14.2) 

1 = 1 

is justified by (a) the law of large numbers and (b) the Central limit theorem [32]. Let us denote 
E[/(x)] = /if and the respective variance as var[/(x)] := E [(/(x) — E [/(x)]) 1 2 ] = cr?. Then the pre- 


1 In the case of discrete variables, p(x ) becomes the probability mass function Pix), and integrations are replaced by summa- 

tions. 
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viously referred two theorems guarantee that 

lim E^jv = 

N^-oo 


and 





The limit in Eq. (14.3) refers to the notion of almost sure convergence, that is, 


( 14 . 3 ) 


(14.4) 


Proh 


lim \n f — E/jv 

N-*oo 



= 1 . 


The approximate Gaussian distribution in Eq. (14.4) guarantees that the variance (as one changes the 
set of N samples) of the obtained estimate E r ; y around the true value p f decreases with N. 

Thus, if one generates the samples x n , n = 1 .2,.... /V, from the distribution pix), the use of 
Monte Carlo techniques offers the means for an approximation of the integral in Eq. (14.1) with the 
following nice properties: (a) the approximation error is decreasing as ; (b) the obtained estimate 
using N samples is an unbiased estimate of the true value; and (c) the convergence rate is independent 
of the dimensionality /. The latter property is in contrast to methods based on the deterministic numer- 
ical integration, which, in general, have a rate of convergence that slows down as the dimensionality 
increases. In Monte Carlo techniques, if one is not satisfied with the obtained accuracy, ali he/she has 
to do is generate more samples. 

The crucial point now becomes that of developing techniques to generate i.i.d. samples from p(x). 
This is not an easy task, especially for high-dimensional spaces. Note that achieving a certain accuracy 
for the estimator in Eq. (14.2) is independent of the dimensionality, once i.i.d. samples drawn from 
p(x) are available. On the other hand, drawing i.i.d. samples typical of pix) becomes harder as the 
dimensionality increases. We will return to this point soon. In the sequel, we will focus on some basic 
directions to achieve the aforementioned goal. 


14.2.1 RANDOM NUMBER GENERATION 

Random number generation can be achieved either as the resuit of an experiment or via the use of 
computers. For example, the tosses of a fair coin can generate a random sequence of Os (heads) or ls 
(tails). Another example is the sequence of numbers corresponding to the distance between radioactive 
emissions; such an experiment generates a sequence of exponentially distributed samples. However, 
such approaches are not of much practical value and the emphasis has been on techniques that generate 
samples via a computer, using a pseudorandom number generator. At the heart of such methods lie 
algorithms that guarantee the generation of a sequence of integers, Zi , which approximately follow a 
uniform distribution in an interval in the real axis. In the sequel, the generation of random numbers/vec- 
tors, which follow an arbitrary distribution, is obtained indirectly via a variety of methods, each with 
its pros and cons. The path for generating integers in an interval (0, M) follows the general recursion, 


Zi = g(Zi- 1, ■ ■ -, Zi-m) mod M , 




734 CHAPTER 14 MONTE CARLO METHODS 


where g is a function depending on the m previously generated samples and mod denotes the modulus 
operation; that is, Zi is the remainder of the division of g(zi- 1 ,..., Zi- m ) by M. The simpler form is 
the linear version, 

Zi = azi~\mod M, zo = 1, i > 1, (14.5) 

where M is a large prime number and a is an integer. Recursion (14.5) generates a sequence of 
numbers between 1 and M — 1. The method is known as linear congruential generator or Lehmer’s 
algorithm [20]. 

If a is properly chosen, then the resulting sequence of numbers turns out to be periodic with period 
M — 1. This is the reason we call these generators pseudorandom, because a periodic sequence can 
never be claimed to be random. However, for large values of M, the obtained sequence can be suffi- 
ciently random with uniform distribution, provided, of course, that N < M — 1. For example, a value 
of M of the order of 10 9 is sufficient for most applications. Note that not ali possible choices of the 
parameter a guarantee a good generator. In practice, a sequence is accepted as being random only if it 
meets a number of related tests of randomness and is subsequently used successfully in a variety of ap¬ 
plications (see, e.g., [32]). A common choice of parameters that leads to a reasonably good uniformly 
distributed random sequence is a — 7 5 and M = 2 31 — 1 (see, e.g., [34]). More on this topic can be 
found in Knuth’s classical text and the references therein [18]. Once a sequence of integers is avail- 
able, a sequence of uniformly distributed real random numbers is obtained as the ratio x,- = e (0,1) 
(as a matter of fact, this is the sequence on which the randomness tests are applied). 

Remarks 14.1. 


• Note that even the generation of a sequence of (pseudo)random numbers with uniform distribution 
in (0, 1) is not an easy task, in spite of the fact that the uniform distribution is an “easy” one; that is, 
ali values are equally probable. Moreover, often in practice, a PDF is known up to its normalizing 
constant, i.e., 


where 


p(x) = 


<P(x) 

Z 


Z = 



<p{x)dx. 


However, if </>(x) has a complicated form, the previous integration may be intractable. This is of¬ 
ten met when computing posterior PDFs. The previous points make the process of sampling from 
a general pix) much harder than for the case of a uniform one. The task becomes even harder in 
high dimensions, even if Z is available. The required number of points, in order to cover suffi- 
ciently a region in a high-dimensional space, exhibits an exponential dependence on the respective 
dimensionality (curse of dimensionality). Thus, a huge number of points is needed in order to get 
a good representation of pix ) in high-dimensional spaces. In practice, one would be more content 
to generate samples from the regions where p(x) gets relatively high values. However, the higher 
the dimensionality, the more difficult the task of locating the high-probability regions. Similar ar- 
guments hold for random variables of a discrete nature, where the number of States that the variable 
can take is very large. Ideally, in order to have a representative sequence of samples, ali States have 
to be visited. 




14.3 RANDOM SAMPLING BASED ON FUNCTION TRANSFORMATION 735 


14.3 RANDOM SAMPLING BASED ON FUNCTION TRANSFORMATION 

In this section, we deal with some of the most basic techniques for drawing samples from a PDF, p(x). 


Function inversion. Let x be a real random variable with a PDF, p(x), and a corresponding cumulative 
distribution function 

F x (x)= f p{r)dr. 

J — OO 

It is known from probability theory that the random variable, u, defined as 

u:=F x (x), (14.6) 


is uniformly distributed in the interval 0 < u < 1 irrespective ofthe nature of p(x) (see [32] and Prob- 
lem 14.1). If, in addition, we assume that the function F x has an inverse, F ~ 1 , then we can write 

x=F- I (u). (14.7) 

Thus, following the reverse arguments, samples from p(x) can be generated by first generating sam¬ 
ples from the uniform distribution, U(u\0. 1), and then applying on them the inverse function, F~ l 
(Problem 14.2). 

This method works well provided that F x has an inverse that can be easily computed. However, 
only a few PDFs can be “proud” of having inverses that can be expressed in an analytical form. 

Example 14.1. Generate samples, x n , that follow the exponential distribution, 

p(x) — Xexp(— Xx), x > 0, X > 0, (14.8) 


using a pseudorandom generator that generates samples, u n , from the uniform distribution U(u\0, 1). 
We have 

r 

F x (x) = I kexp(— Xr)clr = 1 — exp(— Xx). 

J o 


By letting 


u := F x (x) 


and solving for x, we get 

1 

x = -ln(l — u) := F x l (u). 

X 

Hence, if u n are samples drawn from a uniform distribution, the sequence 


X n 


1 

-ln(l — u n ), n = l,2,...,N, 

X 


are samples drawn from the exponential PDF in Eq. (14.8). Fig. 14.1 shows the histogram of the 
generated samples for N = 1000, alongside p(x) for X = 1. 
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FIGURE 14.1 

The histogram of the samples generated from the uniform distribution, and using the inverse of F x , which describes 
the exponential PDF, whose curve is shown in black. The length of the bin interval, Ax, was chosen equal to 0.02. 


Example 14.2. Generating samples from discrete distributioris. Here, an intuitive method for generat- 
ing samples from discrete distributions is presented. We will use such distributions in Section 17.2. 

Let xi, X 2 ,..., xjf denote discrete random events occurring with probabilities P\, Pi..... Pr■ re- 
spectively, such that p k = 1. Then the following simple algorithm draws samples from this 

distribution. 

Algorithm 14.1 (Sampling discrete distributions). 

• Define a k = YaZ\ p h b k = E?=i p i> k = 1,2,..., K, a\ =0. 

• For i — 1,2,.... Do 

- w~ U( 0,1) 

- Select 

* x k if u e [a k , b\), k=l,2,...,K 

• End For 

Fig. 14.2 provides an illustration of the algorithm. Note that the probability jumps at the begin- 
ning of each interval and the corresponding cumulative distribution function (CDF) is constructed; the 
algorithm basically computes the inverse of u, according to this CDF (see, e.g., [4]). 

Function transformation. We will demonstrate the method via an example involving the transformation 
of two random variables, say, r and 4), to two new ones, x and y. Let 


x = gx(r, 4>) 


and 


Y = £y(r, T)- 
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u 

Pl+P2 
P\ 

0 1 2 3 K - 1 K 



FIGURE 14.2 

The CDF for a discrete distributiori of K discrete random events. If P\ + P 2 < u < P\ + P 2 + P 3 , the event xj is 
drawn. Note that the higher the probability of an event is, the larger the corresponding interval jump in the CDF is, 
hence the higher the probability of this event being drawn. 


Let us now assume that there is a unique solution for the inverses and that they can be expressed in an 
analytic form (which is not the case in general), that is, 


r = gr(x, y), 

4> = ^<t>(x, y). 

We know from Section 2.2.5 that if p v p,(r, <})) is the joint distribution of r and (|>, then the joint distribu- 
tion of x and y is given by 


, , p r ,<i>(g r (*,:y), &!>(*, y)) 

P -' (X ’ y) = de,(y(x.y: r.4>)) 

= Pr,${gr(x, y), g<|)(x, y)) |det(/(r, ()>; x, y)) |, (14.9) 

where |det(/(x, y; r, <|))j | is the absolute value of the determinant of the Jacobian matrix, 


i(x, y; r, 4>) 


dgx dgx 
3 r 3 (f> 
dgy dgy 

_ 3 r d<p _ 


(14.10) 


/(r, (|>: x, y) is analogously defined, and we have assumed, for simplicity, that to each value of ( r , <p) 
there corresponds one value of ( x , y). Let us now see how one can generate samples from a Gaussian 
p(x) = A r (x|(). 1) by using samples drawn from a uniform and an exponential distribution, respectively, 
for <)> and r; recall that in Example 14.1, we described a technique for generating samples from an 
exponential distribution. 


The Box-Muller method. Let r be distributed according to an exponential distribution, 


1 

Pr(r) = - exp 



r > 0 , 


(14.11) 
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and <j> to a uniform distribution, U(4> |0, 1), 


,,. I 0<<t>< 2:r, 

P * <0) = I 0, otherwise, 

and also assume that they are independent, that is, 

= p r (r)p^(<p). 

Generate two new random variables as 

x = \/rcos4), 


(14.12) 


(14.13) 

(14.14) 


y = -v^sin<)). (14.15) 

The physical interpretation of the previous transformation is that x, y correspond to the Cartesian co- 
ordinates of a point and r, <j> are its polar ones (Fig. 14.3). From Eqs. (14.14) and (14.15), we can 
write 

r = x 2 + y 2 , (14.16) 




p<t>(0) 


0 27 T <f> 



(C) 


(D) 


FIGURE 14.3 

(A) Relation of the Cartesian ( x , y) to the polar coordinates (r, 4 >). (B, C) If r and <j> are random variables following 
an exponential and a uniform in [0, 27r] distribution, respectively, then x and y are independent and they both follow 
a normalized Gaussian, as shown in (D) for the x-variable. 
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<t> = arctan . 

Adjusting Eq. (14.9) to our current needs, using Eqs. (14. 1 1)—(14.13), we obtain 


1 1 

Px, y(x,y) = —-exp 
2tx 2 

1 

exp 


'Jljt 


x 1 + y 2 


X 

'~2 


V2tt 


exp 


y_ 

' 2 


(14.17) 


(14.18) 


where we have used that (Problem 14.3) 

\J (x, y; r, <j>)| = 

Thus, we have shown that using the transformation given in Eqs. (14.14) and (14.15), we can generate 
samples from normalized Gaussians. 

Once samples from a normalized Gaussian, A/”(x|0, 1), are available, samples from a general Gaus- 
sian, J\f(y\n, a 2 ), are obtained via the obvious transformation 


y — ax+iM. (14.19) 

The previous approach is also generalized to random vectors in MJ. One can first draw samples from 
x ~ Af(x\0,1 ) by stacking together / i.i.d. samples drawn from a normalized Gaussian, _A/(x|0, 1), and 
then apply the transformation 

y=Lx + fi, 

which is equivalent to drawing samples from 

y~J\f(y\li, E), (14.20) 

where S — L L 1 (Cholesky factorization, Problem 14.4). 

Example 14.3. Generate N — 100 samples, r„, n — 1,2,..., 100, from the exponential distribution in 
Eq. (14.11) (following Example 14.1) and N — 100 samples, <p n , n = 1,2,..., 100, from the uniform 
in Eq. (14.12). Then use the transformations in Eq. (14.14), (14.15), and (14.19) to obtain samples, 

x n , n — 1,2.100, from p(x) = Af(x 1 1 .0.5). The histogram of the obtained samples is shown in 

Fig. 14.4. 


14.4 REJECTION SAMPLING 

Applying the previously reported transformation techniques relies on having the involved transform 
functions available in a convenient (analytic) form, which in general is the exception instead of the 
rule. From now on, we turn our attention to alternative methods. 

Rejection sampling (e.g., [7,37]) is conceptually a simple technique; in order to generate indepen- 
dent samples from a desired PDF, p{x), one draws samples from another one, say, q{x), that is easier 
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FIGURE 14.4 

The histogram of N = 100 points generated in Example 14.3 together with the graph of p(x) = J\f(x\ 1,0.5). The 
bin length was chosen to be equal to Ax = 0.05. 


to handle, and then, instead of applying a transformation, some of the points are rejected according to 
an appropriate criterion. 

Given two random variables, x and u, recall that the marginal, p (x ), is obtained by integrating the 
joint PDF, p x .u(x, u), that is, 

/ +oo 

p xa (x,u)du. (14.21) 

-oo 

Let us now consider the following identity: 

rp(x) r+oo 

p(x) = 1 dx = X|0,p(x)| (u)du, (14.22) 

Jo J —oo 

where X[0,p(x)](0 i s oul ' familiar characteristic function in the interval [0, pix)], that is, 


X[0,p(x)](M) 


1, 0 < u < p{x), 

0, otherwise. 


Comparing Eqs. (14.21) and (14.22), it turns out that X[0,p(x)]( M ) can be interpreted as the joint PDF 
of the pair (x, u) defined over the set 

A — {(x, u ): rei, 0 < u < /?(x)}. (14.23) 

Looking more carefully at /) x u (x, u ) = X[0,pM]( u ), it does not take long to realize that this is the 
uniform density under the area of the graph u — pix), as seen in Fig. 14. 5A. In other words, if one 
filis in the shaded area in Fig. 14. 5A uniformly at random with points (x, u) and then neglects the u 
dimension, then the obtained points are samples drawn from pix). We can now go one step further and 
assume that p(x) is not exactly known, that is, 


1 

Pix) = -(P(x), 
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FIGURE 14.5 

(A) Filling in the shaded area uniformly at random with points ( x„ , u„), after neglecting the coordinate u n , is equiv- 
alent to drawing points x„ from p(x). (B) The proposal distribution, cq(x), is everywhere larger than or equal to 
Pix). 


and that the normalizing constant is not available (as we know, often, the computation of the normaliz- 
ing constant is not easy). Then we have 


1 1 r<P(x) 1 r+oc 

p(x) = — 00) = — / du=— I X[O,0(JC)](«)^M, 
Z Z JO Z J-O O 


and Z is given by 


Hence, 


/ +0C /-+0O 

/ X\(),<l>(x)\( u )dudx. 

-oo J —oo 


(14.24) 


In other words, even if p(x) is not exactly known, p(x) can stili be obtained, this time in terms of the 
uniform distribution, X\Oji><xj\(x, u), normalized appropriately. However, rescaling the uniform does 
not affect the marginal. It suffices to sample uniformly at random the region A, which now should be 
defined in terms of <p(x) instead of p(x). What we have said so far applies also to random vectors, 
x e by considering the extended space (x, u), and we talk about the volume under the surface (j>(x) 
(or p(x)). 

We now turn our attention to see how one can fili in the volume under the surface formed by u — 
(j> (x) (or u = pix) if it is fully available), with points uniformly at random. Let q(x) be a distribution 
from which we know how to draw samples; we refer to it as the proposal distribution. We select a 
constant c such that 


(pix) < cqix), Wx e M. 1 . 

The respective geometry is shown in Fig. 14. 5B. The goal is to draw points in the interval [0, o/Qc)] 
and then keep only those that lie in the region under the surface u — <pix). The following algorithm 
does the job. 


2 


If q(x) = 0 in an interval, then (/>(*) should be zero there. 
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< Kx ) 


X 


X 


(A) 


(B) 


FIGURE 14.6 


(A) If cqix ) is much larger than 4>{x) most of the samples are rejected (red points), which is inefficient. (B) If cqix) 
and 4>{x) have a good match, most of the samples are retained. 

Algorithm 14.2 (Rejection sampling). 

• For i = 1,2,. .., N, Do 

- Draw Xi ~ q(x) 

- Draw Ui ~ U (0, cq(x ,)) 

- Retain the sample if 
* Ui < (pUj) 

• End For 

The probability of accepting a point, x, is given by 


Prob {u < <p(x)} = — —<p(x), 
cq(x) 

and the total probability, over all the possible values of x, for accepting samples is equal to 




q(x)dx = - 4>(x). 


Hence, if c has a large value, only a small percentage of points is finally retained. In order to have a 
practical algorithm, cq(x) must be chosen in order to be a good fit of (f) (x). Fig. 14. 6A is an example of 
a bad choice, while Fig. 14. 6B corresponds to a good example. This is a reason that rejection sampling 
does not scale well with dimensionality. In high dimensions, guaranteeing that cq(x) > (pix) may 
oblige us to select a c with an excessively large value (Problem 14.5). 

Besides the basic rejection scheme, a number of variants have also been proposed, in order to 
overcome the difficulty of selecting a proposal distribution that “looks like” the desired one. Adap- 
tive rejection sampling is such a technique (see, e.g., [11] and the references therein). According to 
this method, the proposal distribution is adaptively constructed, based on the derivatives of ln pix). 
For log-concave functions, this is a nondecreasing function and can be used to construet an envelope 
function of p (x ). 
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Although rejection sampling is not appropriate for difficult tasks, it has stili been used, sometimes 
in its more refined forms, to generate samples from a number of Standard distributions, such as the 
Gaussian, gamma, and Student’s t (see, e.g., [22]). 


14.5 IMPORTANCE SAMPLING 

Importance sampling (IS) is a method for estimating expectations. Let /(x) be a known function of 
a random vector variable, x, which is distributed according to pix). If one could draw samples from 
p(x), then the expectation in Eq. (14.1) could be approximated as in Eq. (14.2). We will now assume 
that we are not able to draw samples from p(x), and to go one step further, assume that pix) is only 
known up to a normalizing constant, that is, 


1 

p(x) = —4>(x). 


Let q(x) be another distribution from which samples can be drawn. Then we can write 
E [ 


00 i r°° <j>(x) 

f(x)4>(x)dx = — / f(x)^—q(x)dx 


'[/(*)] = ^J f{x)<t>{x)dx = ^j 

1 N 


q(x) 


(14.25) 


i =1 


where x,, i — 1.2,.... /V, are samples drawn from q (x) and 

<p(x) 


w(x)\— 


q{x) ' 


(14.26) 


The normalizing constant can readily be obtained as 

: f <p(x)dx = [ 

J — OO J — 


Z = 


N 




(14.27) 


i=l 


Combining Eqs. (14.25) and (14.27), we finally obtain 

E/Il W(Xi)f(Xi ) 


E[/(x)]~ 


E/=i w (xi) 


(14.28) 


or 


N 

E[/(x)] ~ W (Xi) f(Xj) : importance sampling approximation, 

i=i 
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— /1 --• B 


C 


FIGURE 14.7 


If q(x) is not a good match of 4>(x), a number of undesired effects appear. Samples from region A will give rise to 
weights of much larger values compared to those in region B. Due to the extremely low values of q(x) in C, it is 
highly likely that given the finite size of the number of the samples, N , no samples will be drawn from this region, 
in spite of the fact that this is the most dominant region for cf>(x)(p(x)). 

where W(x,) — ^T' u ' 1 — are the normalized weights. It is not difficult to show (Problem 14.6) that 
E;=i «'(*>) 

the estimate 



(14.29) 


corresponds to an unbiased estimator of the normalizing constant. This is very interesting, because 
computing the normalizing constant is particularly useful information in a number of tasks. Recall that 
the evidence function, discussed in Chapter 12, is a normalizing constant; see also [26] for related 
comments. 

In contrast, the estimator associated with Eq. (14.28), being the resuit of a ratio, is unbiased only 
asymptotically and it is a biased one for finite values of N (Problem 14.6). Hence, if one would have the 
luxury of a very large number N of samples, Eq. (14.28) would be a good enough estimate. However, 
in practice, N cannot be made arbitrarily large and the resulting estimate may not be satisfactory. 

If q(x) — p(x), or at least q(x) is a fairly good approximation of <p(x), then Eq. (14.28) would 
approximate Eq. (14.2). However, for most practical cases, this is not easy to obtain, especially in 
high-dimensional spaces. If q(x) is not a good match to <f>(x), it is very likely that there will be regions 
where <p{x) is large while q(x) is much smaller. The corresponding weights will have large values, 
relative to those from other regions, and they will be the dominant ones in the summation (Eq. (14.28)). 

The effect of it is equivalent to reducing the number, N, of samples. Moreover, it is also possible that 
q(x) takes very small values in some regions, which makes it very likely that samples from such regions 
are completely absent in Eq. (14.28) (see Fig. 14.7). In such cases, not only may the resulting estimate 
be wrong, but we will not be aware of it, and the variance of the weights, w(x n ) and w(x n ) /( x n ), may 
exhibit low values. These phenomena are accentuated in high-dimensional spaces (see, e.g., [26] and 
Problem 14.7). 

To alleviate the previous shortcomings, a number of variants have been proposed to search for 
high-probability regions and make local approximations around the modes and use them in order to 
generate samples (see, e.g., [30] and the references therein). 
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14.6 MONTE CARLO METHODS AND THE EM ALGORITHM 

In Section 12.4.1, the EM algorithm was introduced for maximizing the log-likelihood function when 
some of the variables are hidden or missing. During the E-step (Eq. (12.40)), the function Q is com- 
puted, which at the (j + l)th iteration step of the algorithms is written as 

Q($.€° -) ) = E[ln p(X l ,X;$)] 

= J p(X l \X;^ j) )lnp(X', X\$)dX l , (14.30) 

where X 1 is the set of hidden variables, X the set of observed values, and § the unknown set of parame- 
ters. In case the computation of the integral is not tractable, Monte Carlo techniques can be mobilized to 
generate L samples for the hidden variables, X [,..., X l L , from the posterior piX 1 \X: t; 1 ' 1 ) and obtain 
an approximation 

1 L 

Q«, $ a) ) » - J]ln P {xj, X; |). (14.31) 

^ i=t 

Maximization with respect to £ is now carried out via Q. 

A specific form of Monte Carlo EM results in the context of mixture modeling; this is known 
as stochastic EM. The idea is to generate a single sample from the posterior (which now refers to the 
labeis of the mixtures) and assign corresponding observations in the respective mixtures. That is, a hard 
assignment takes place. The M-step is then applied based on this approximation [ 3 ]. 


14.7 MARKOV CHAIN MONTE CARLO METHODS 

As we have already discussed, a major drawback associated with rejection as well as importance sam- 
pling methods is that they cannot tackle tasks in high-dimensional spaces very well. 

In this section, we will deal with methods that scale well with the dimensionality of the sample 
space. Such techniques build upon arguments that come from the theory of Markov chains; we start by 
presenting some definitions and basies related to this important theory. Hidden Markov models, treated 
in Chapter 16.5, are instances of Markov chains. Here we will shed more light on such models from a 
different perspective. 

Markov chains/processes are named after the Russian mathematician Andrey Andreyevish Markov 
(1856-1922), who contributed seminal papers in the field of stochastic processes. As a professor at 
Saint Petersburg University during the students’ riots in 1908, he refused the government’s order to 
monitor and spy on his students, and he retired from the university. 

Definition 14.1. A Markov chain is a sequence of random (vector) variables, xq. X| . xt ..., with con- 
ditional distributions that obey the rule 

pix ,, |x„_i, {x,:te X}) = p(x n |x„_i), (14.32) 

where I = {0,1,..., n — 2}. The index n is usually interpreted as time. 
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In words, Eq. (14.32) says that x„ is independent of the variables with indices in X, given the values 
of the variables in x„_ i. The distribution p can be either a density function or a probability distribution 
corresponding to discrete variables, taking values in a discrete set, known as States. We will assume 
that ali variables share a common range, known as state space. Most of our discussion will evolve 
along finite state spaces, where States take values in a finite discrete set, say, {1,2,..., K}. A Markov 
chain is specified in terms of (a) the distribution (vector of probabilities), p 0 , associated with the first 
vector in the sequence, xo, and (b) the K x K matrices of the transition probabilities, that is, 

P n (x n \x n — [) = \_Pn {1 Ii)]» 

where 

Pn(i\j) := Pn{x n = i\x n -i = j), T 7 = 1,2,..., ^T, 

denotes the probability of the variable at time — 1 to be at state j and the variable at time n to be at 
state i. Given the matrix of the transition probabilities, we write 

Pn = P n(x n \x n -\)p n -\, (14.33) 


where 


p n := [P(x n — 1), P(x„ =2),..., P(x n = K)] t 
:=[P n (l),...,P n (K)] r 


(14.34) 

(14.35) 


is the vector of the respective probabilities at time n. The Markov chain is said to be liomogeneous or 
stationary if the transition matrix is independent on time, that is, 

P n (x n = i\x n -\ = j) = P(i\j) := Pij, i , j = 1,2,..., K, 

and 

P n (Xn\Xn-l) = P = [Pij] . 

In this case, we can write 

p n = Pp n -l = P 2 Pn—2 = ■■■ = P"Po . (14.36) 

or equivalently, 

K 

P n {i) = Y J PijPn-\ii)- (14.37) 

1=1 

In the sequel, we will focus on stationary Markov chains. 

Properties ofthe transition probabilities matrix. The transition matrix has a special structure leading 
to certain properties, which will be used later on. 


3 


Note that equality here, x n = i, means that the (vector) variable x n is at state i. 
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• The matrix P is a stochastic matrix. That is, all its entries are nonnegative, and the entries across 
each column add to one, that is. 


K 

E 7 '/- 1 - 

1=1 

which is a direct consequence of the definition of probabilities. 

• The value X = 1 is always an eigenvalue of P (Problem 14.8). Moreover, there is no eigenvalue with 
magnitude larger than one (Problem 14.9). 

• The eigenvectors corresponding to eigenvalues X ^ 1 comprise components that add to zero (Prob¬ 
lem 14.10). 

• The left eigenvector corresponding to X — 1, 

b\P = b\ , (14.38) 

has all its elements equal. This is easily verified by plugging in b\ = [ 1, 1__l] 7 and checking 

that this is indeed an eigenvector. 

• Invariant distributiori. A distribution is said to be invariant over the States of a Markov chain if 

p=Pp. 

Note that p is necessarily an eigenvector corresponding to the eigenvalue X = 1. Moreover, because 
p consists of probabilities, its elements must add to one. Depending on the multiplicity of X = 1, 
there may be more than one invariant distribution. For example, if P = I any probability distri¬ 
bution is an invariant for the respective Markov chain. It turns out that any Markov chain with a 
finite number of States has at least one invariant distribution. However, if the elements of P are 
strictly positive, then there is a unique invariant distribution that coincides with the unique eigen¬ 
vector corresponding to the maximum eigenvalue, X = 1, which in this case has a multiplicity of 
one. Furthermore, the eigenvector corresponding to X = 1 comprises positive elements, which after 
scaling can always be made to add to one. This is a by-product of the celebrated Perron-Frobenius 
theorem, which is ensured if P has strictly positive entries 4 (see, e.g., [32]). We will cover invariant 
distributions in more detail soon. 

• Detailed balanced condition. Let P be the transition probability matrix of a stationary Markov 
chain. Let also p = [P\,..., P^] T be the set of probabilities describing a discrete distribution. We 
say that the detailed balanced condition is satisfied if 


P(i\j)Pj = P(j\i)Pi. (14.39) 

That is, there exists a type of symmetry. If this condition holds, then the respective distribution is 
invariant for the Markov chain. Indeed, 


4 This is also true for a class of matrices with nonnegative elements, known as primitive matrices. That is, there exists an n such 
that P n has positive elements. 
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K K 

J2 p ( i \j) p j = Y, p U\i)Pi = Pi, 


j =1 


j =1 


(14.40) 


or 


p=Pp. (14.41) 

Although this is not a necessary condition for distribution invariance, it is very useful in practice; 
it helps us construet Markov chains with a desired invariant distribution. As we will soon see, this 
will be the type of distributions from which we want to draw samples. 

14.7.1 ERGODIC MARKOV CHAINS 

We now turn our attention to a specific type of Markov chains, which are known as ergodic. Such 
chains have a unique invariant distribution, which can be obtained as the limit 

lim p n = lim P" p 0 , 

n —>oo n—> oo 


which is independent of the choice of the initial values in p 0 . We will now focus on a class of ergodic 
processes and elaborate on their convergence. 

Let us consider a stationary Markov chain, with a transition matrix P, with eigenvalues 1 = Ai > 

| A. 2 1 > • • • > /. a" I ■ That is, only one eigenvalue has the maximum value and the rest have magnitude 
strictly less than one. Moreover, we assume that one can find a complete set of linearly independent 
eigenvectors. Such assumptions are not restrictive and hold true for a wide class of stochastic matrices. 
Then we can write 


P = AAA -1 , (14.42) 

where A is the diagonal matrix A = diagjl, A 2 ,..., Ajf} and A has as columns the respective eigen¬ 
vectors. Hence, from Eq. (14.36) we get 


Pn 


AA n A l p 0 = A 


A” 


O 


O 


A l p 0 . 


A 


n 

K J 


with AJ! —> 0, k — 2,..., K. 
Hence, 


Poo-= V ™lPn = PooPl), 

n—> oo 


(14.43) 
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where 


Poo — A 


0 O 


O 


A 1 = a\b\, 


(14.44) 


with a\ being the first eigenvector (first column of A), corresponding to a = 1, and b'[ the first row of 
A -1 . In other words, P 0Q is a rank one matrix. However, it is straightforward to see from Eq. (14.42) 
(A -1 P = AA -1 ) that b\ is a left eigenvector of P, that is, 

b\P = b\, 

and recalling the properties (Eq. (14.38) and the comments just after it) of P , b [ =[1,1, ...,1] (within 
a proportionality constant c). Thus, 


^oo = [«i. • • •. «lL 

and from Eq. (14.43), because the components of p Q add to one, we finally obtain 

Poo = a l- 

That is, the limiting distribution is equal (after scaling) to the unique eigenvector of P corresponding to 
a i = l; moreover, this is true irrespective of the values of p 0 . In other words, the limiting distribution 
is the invariant distribution of P, that is. 


Pp = p. (14.45) 

Note that the convergence rate is controlled by the magnitude of | A. 2 1. Other, more theoretically refined, 

convergence results and bounds can be found in, for example, [24,39,40]. 

Remarks 14.2. 

• Needless to say, not all Markov chains are ergodic. For example, if the eigenvalue /.] = I of the 
transition matrix has multiplicity higher than one, then the limiting distribution depends on the 
values of the initial choice in p 0 . On the other hand, if the transition matrix has more than one 
eigenvalue with magnitude equal to one (e.g., /, 1 = 1, A 2 = — 1) then, again, it does not have a 
limiting distribution but instead exhibits a periodic limit cycle (e.g., [32]). 

• Building Markov chains. In practice, one can construet transition probability matrices for ergodic 
chains using a set of simpler transition matrices, B\, B2,.. ., Bm, which are known as base tran¬ 
sition matrices. Each one of them may not be ergodic, but it is required to accept the desired 
distribution as its invariant. Then the transition matrix is built as 

M M 

P = ^ ' tx m B m , ot fn > 0, ^ ' ot m — 1. 

m =1 m =1 
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If a distributiori is invariant with respect to each B m , m = 1, 2,..., M, it will be invariant for P. The 
same applies with the detailed balance condition. 

Another way is to combine individual transition matrices sequentially, that is, 

p = B\B 2 - b m . 


For example, each B m , m — 1,2,..., M, may act and change a subset of the random entries com- 
prising the random vector x. We will see that this is the case with the Gibbs sampling method, to be 
reviewed soon. It is easy to see that if p is invariant for each individual B m , m — 1,2...., M. it will 
also be invariant for P. 

• In this section, we focused our discussion on Markov chains with finite state space. Everything we 
have said can be generalized to Markov chains with countably infinite or continuous state spaces. 
In the latter case, the place of the probability transition matrix is taken by the transition density or 
kernel, p(x„ |jc„_i), and the probability density of x„, at time n, is given by 

Pn(x) = J p(x\y)p n -i(y)dy. (14.46) 

The analysis in this case is more difficult and care has to be taken because not ali results obtained 
for the finite discrete case are readily valid for the continuous one. The reason we focused on the 
discrete finite state space is that one can get the feeling of the theory of Markov chains by spending 
less “budget” on the required mathematical effort. 

Example 14.4. Consider the Markov chain with the transition probability matrix 


P = 


0.2 0.4 0.6 
0.5 0.1 0.3 
0.3 0.5 0.1 


Its eigenvalues are Ai = 1, A .2 = —0.3 + 0.1732;, and ^ = —0.3 — 0.1732;, and the respective 
eigenvectors are 

«i = [0.6608, 0.5406, 0.5206] r , 

«2 = [0.5774, -0.2887-0.5;, - 0.2887 + 0.5;] r , 

«3 = [0.5774, -0.2887 + 0.5;, - 0.2887 - 0.5;] r . 


Observe that ali the elements of the eigenvector corresponding to = I are positive. Also, the elements 
of the other two eigenvectors add to zero. 

We can now write 


P = 


0.6608 0.5774 

0.5406 -0.2887-0.5; 
0.5206 -0.2887 + 0.5; 


0.5774 

-0.2887 + 0.5; 
-0.2887 - 0.5 j 


1 0 0 
0 -0.3 + 0.1732; 0 

0 0 -0.3-0.1732; 


x 
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X 


0.5807 0.5807 0.5807 

0.5337 — 0.0058/ -0.3323 + 0.4942; -0.3323 -0.5058; 
0.5337 + 0.0058/ -0.3323 -0.4942; -0.3323 + 0.5058; 


Observe that the first row of the last matrix (A 1 ) has all its elements equal. Having written P in a 
product form, the following is easy to obtain: 


P 2 


0.42 0.42 0.30 
0.24 0.36 0.36 
0.34 0.22 0.34 


pio 


0.3837 0.3837 0.3837 
0.3140 0.3140 0.3139 
0.3023 0.3023 0.3029 


The sequence has converged for n = 10. Note that after convergence, P" has all its column vectors 
equal, and the elements add to one. Moreover, observe that 


Poo oc [ai, ai,«)]. 


Example 14.5. Random walk with finite States. Random walks are popular models that can model 
faithfully a number of real-world phenomena, such as thermal noise, the motion of gas molecules, and 
stock value variations. Moreover, such chains can help us understand the behavior of more complex 
Markov chains, to be discussed soon. There are various random walk models, depending on the choice 
of the transition probabilities (see, e.g., [32]). Here, we assume the variables to be discrete and take 
integer values in a finite set, [0, N]. Hence, the total number of States is N + 1. At every time instant, 
the value of the variable can either increase or decrease by one with probability p, respectively, or stay 
unchanged, with probability q, provided that the current state is in the interval [1, N — 1]. That is, if 
0 < x n -\ < N , 


P(x n = X„-l + 1) = P(Xn = X n — 1 - 1) = P, 

P(x„ =x n -i ) = q. 

If x n -\ =0, then x„ can either stay in the same state with probability q e or increase by one with 
probability p. If x n - \ — N , then x„ can either stay in the same state with probability q e or decrease by 
one with probability p. Obviously, 


2p + q = l, p + q e = 1. 

The transition probability matrix, for the case of N — 4, p — q = and q e = |, is 


P = 


'3/4 1/4 0 0 0 ' 

1/4 1/2 1/4 0 0 

0 1/4 1/2 1/4 0 

0 0 1/4 1/2 1/4 

0 0 0 1/4 3/4 
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The respective eigenvalues are ki = 1, A, 2 = 0.904, A, 3 = 0.654, A4 = 0.345, and A 5 = 0.095. Observe 
that all eigenvalues, except Ai = 1, have magnitude less than one. The corresponding eigenvectors are 

a, = [0.447, 0.447, 0.447, 0.447, 0.447] r , 
a 2 = [-0.601, -0.371, 0, 0.371, 0.601] r , 
a 3 = [-0.511, 0.195, 0.632, 0.195, -0.51lf, 

«4 = [-0.371, 0.6015, 0, -0.601, 0.37lf, 

« 5 = [0.195, -0.511,0.632, - 0.511, 0.195] r . 

The eigenvector corresponding to k\ has all its components equal and positive. Hence, the invariant dis- 
tribution ( p : P p = p), after the required scaling, becomes the uniform one, p — [1/5, 1/5, 1 /5, 1 /5, 
1/5] 7 ". Similar arguments apply for any value of N. Observe that the components of all the other 
eigenvectors add to zero. 

Fig. 14.8 shows the probability distribution p n for the case of N = 4, at times n = 10, 50, and 100. 
The components of p 0 were randomly chosen. Fig. 14.9 corresponds to the case of /V = 9. Observe 
that the larger the value of N, the slower the convergence. 

Example 14.6. In this example, we consider a random walk with (countable) infinitely many States; 
at every time instant, the value of the random variable can either increase or decrease by one, with 
probability p, or stay in the same state with probability q, that is, 

P(x n = X n -l + 1) = P(x n = Xn-\ - 1) = p, 

P(x n =x n -\) — q. 


and 


2p + q = 1. 

The difference with the previous example is that now there are no “barrier” points and the random 
variable can take any integer value. Our goal is to compute the mean and variance as functions of 
time n, when the starting point is deterministically chosen to be xq = 0. 



(A) (B) (C) 


FIGURE 14.8 


The probability distribution for the random walk chain of Example 14.5 for N = 4 and at time instants (A) n = 10, 
(B) n = 50, and (C) n = 100. 
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(A) (B) 



(C) 


FIGURE 14.9 

The probability distributiori for the random walk chain of Example 14.5 for N = 9 and at time instants (A) n = 10, 
(B) n = 50, and (C) n = 100. Compared to Fig. 14.8, observe that the higher the number N, the slower the conver- 
gence. 


It is readily seen that E[x„] = 0, because the variable is equally likely to increase or decrease and 
hence it is equally likely to assume any positive or negative value. 

For the variance, we obtain (Problem 14.1 1) 


E[ X 2]=E[ X 2_ 1 ] + 2 p 

= 2pn + E[ X q] = 2pn, (14.47) 

because E[ X q] = 0. Note that the variance tends to infmity with time, hence the infinite state-space 
random walk does not have a limiting distribution. This verifies what we said before; results that hold 
true for finite state spaces do not necessarily carry on to the case where the number of States becomes 
inhnite. 

Looking at Eq. (14.47) more carefully reveals that, on average, after n time instants, x n would be 
within ± J2pn. If x„ denotes the distance of a point from the origin from where it starts and moves 
backward or forward, then the distance it travels is proportional only to the square root of the time it has 
spent traveling. Although this resuit has been derived for the infinite state space case, stili it can shed 
light on the slow convergence to the invariant distribution that we saw in the previous example, for the 
case of finite state space. As stated in [30], convergence to the invariant distribution can be achieved 
once all points in the state space have been visited, and this has a square root dependence on time. In 
order to get a good enough approximation of the limited distribution, one must be patient enough to 
compute 0(N 2 ) iterations. 
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14.8 THE METROPOLIS METHOD 

The Metropolis method or algorithm, as it is sometimes called, builds upon a surprisingly simple idea, 
and it is the first method that exploited the Markov chain theory for sampling. It appeared in the clas- 
sical paper [27] and it may be the most popular and widely known sampling technique, which has 
inspired a wealth of variants. In contrast to rejection and importance sampling, the proposal distri- 
bution is now time-varying, following the evolution of a Markov chain; the latter is constructed such 
that its transition probability matrix (density kernel) has the desired distribution p(x) as its invariant. 
Moreover, in contrast to the rejection and importance sampling techniques, it is not required that the 
proposal distribution 'looks like” the desired one in order for the method to be useful in practice. The 
proposal distribution depends on the value of the previous state, that is, q{-\x n -\). In words, 

drawing a new sample (generating a new state) depends on the value of the previous one. In its original 
version, the proposal distribution was chosen to be symmetric, that is, 

q(x\y) = q(y\x). 


Later on, it was generalized by Hastings [14] to include nonsymmetric ones. The general scheme is 
known as the Metropolis-Hastings algorithm, which is summarized next. 


Algorithm 14.3 (Metropolis-Hastings algorithm). 


• Let the desired distribution be p(-) = 

• Choose the proposal distribution to be q (-1-). 

• Choose the value of the initial state xq. 


For n = 1,2,..., N, Do 

- Draw x ~ g(-|x„_i) 

- Compute the acceptance ratio 

* “(*!*"—i) = min | 1 ’ q &7. l StxZ) 


- Draw 

* n~W(0,1) 

- If u<a(x |jc„_i) 

* X n —X 

- Else 

* X n =X n -i 

• End For 


The following points are readily deduced from the algorithm: 

• The algorithm does not need the exact form of p. It suffices to know it up to its normalizing con¬ 
stant Z. This is due to the fact that p enters into the algorithm only in the ratio for computing the 
acceptance ratio. 
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• If the proposal distribution is symmetric, the acceptance ratio becomes 


a(x|jc„_i) = min 


4>(x) | 

’ </>(*«- 1 ) 1 ’ 


(14.48) 


and in this case, we sometimes refer to it as the Metropolis algorithm. 

• Note that if a sample is not accepted, we retain the value of the previous state. 

• Observe that a sample is accepted or rejected depending on the value of a(x\x n -i). This is easier 
understood by looking at the original form of the algorithm based on Eq. (14.48). If the probability 
p{x) is larger than p(x„- 1 ), then the new sample is accepted. If not, it is accepted/rejected based 
on its relative value. 

• Successive samples are not independent. 


There are variants of the previous basic scheme, concerning the choice of the function for the accep¬ 
tance ratio. In [33], an argument in support of the rationale behind the Metropolis-Hastings scheme is 
based on an optimality proof concerning the variance of the obtained approximations. 

Let us now turn our focus to understanding how the previously stated algorithm relates to the 
Markov chain theory. We will work with the more general continuous state-space models, and we 
de fine 


p{x\y) = q(x\y)a(x\y) + S(x - y)r(x ), (14.49) 

where r(x) is the rejection probability, 


r(x) = J(l-a(x\y))q(x\y)dy, (14.50) 

and <5(-) is Dirac’s delta function. A little thought reveals that /; (■| • T as defined above, is the transition 
density kernel (transition matrix for finite discrete spaces), p(x n |jc„_i), for an equivalent Markov 
chain. Moreover, this Markov chain has the desired distribution p(x) as its invariant, that is, 

p(x) = J p(x\y)p(y)dy, 

which, as already pointed out in Section 14.7, is a direct outcome of the fact that the following detailed 
balance condition is satisfied (Problem 14.12): 

p(x\y)p{y) = p(y\x)p(x). 

It turns out that the equivalent Markov chain is ergodic, and hence converging to the invariant (desired) 
distribution, provided that p(x\y) and p(x) are strictly positive. This guarantees that any state has a 
nonzero probability to be reached starting from any state. 

Hence, the Metropolis-Hastings algorithm equivalently draws samples from the Markov chain de¬ 
fined by the transition density given in Eq. (14.49), although the samples are drawn from the chosen 
(easily sampled) proposal distribution. Typical distributions used to play the role of the proposal distri¬ 
bution are the Gaussian and Cauchy distributions. The latter, due to its heavy-tail property, allows large 
changes to occur from time to time. Sometimes, the uniform distribution is also used. For the discrete 
case, the uniform distribution seems to be a popular choice. 
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Burn-in phase : After convergence, the process becomes equivalent to drawing samples from the 
desired p(x )! However, nothing is perfect in this world; a major weakness of the Markov chain Monte 
Carlo techniques is that it is difficult to assess whether the Markov chain has converged, and hence to be 
sure that the samples one generates are indeed effectively independent and truly representative of p(x). 
Samples generated before the chain has converged are not representative of the desired distribution and 
have to be rejected; this is known as the burn-in phase. The interval that a Markov chain takes to 
converge is known as the mixing time (e.g., [21]). 

To this end, a number of diagnostics have been proposed, though none of them can be considered 
a panacea (see, e.g., [6,8,35]) for a discussion. A theoretical justification concerning the difficulty of 
accessing convergence of such techniques is provided in [2], where it is shown that this is a computa- 
tionally intractable task. 

In practice, after the rejection of the samples during the burn-in phase, one runs a long chain and 
discards one out of, say, M samples. For large enough values of M, one expects to obtain independent 
samples. This process is also known as thinning. An alternative path is to run a few, say, three to four, 
different (starting from different initial points) chains of medium size (e.g., 100,000) and take samples 
from each of them, having discarded the samples in the respective burn-in phases (e.g., the first half of 
them). 


14.8.1 CONVERGENCE ISSUES 


When dealing with the rejection and importance sampling, we discussed that these methods do not 
scale well with dimensionality. In contrast, the Metropolis approach shows much better behavior, and 
it is an algorithm that lends itself to applications in large spaces. Having said that, the method is not 
without shortcomings. To elaborate, we will use our experience gained from the random walk examples 
and employ similar arguments as those given in [30]. 

Consider a two-dimensional task and adopt as the proposal distribution, q(x |je„_i), the Gaussian 
one with covariance matrix a 2 I and each time-centered at x n -\. The desired distribution, from which 
samples are to be drawn, is another elongated Gaussian A/"(jc|0, E), as shown in Fig. 14.10A. The 
values er max , <r mln denote the scales (Standard deviations) associated with the two axes of the ellipse 
(recall from Chapter 2 that this is defined by the eigenstructure of E), which corresponds to the one 
Standard deviation contour (the exponent in the Gaussian is equal to —1/2) of p{x). 

Every time a sample is drawn from J\f(x\x n -i,a 2 I), the new sample will be within the circle 
of radius a around x n -\, with high probability. In order for the new sample to have a large chance 
to lie within the high-probability elliptical region, a must be of the order of er min or smaller. If a is 
chosen to have a large value, there is a high probability for the sample to end up outside the ellipse 
and be rejected. Hence, once sampling starts inside the ellipse, small values of a guarantee, with high 
probability, that samples remain within the ellipse, and hence are accepted. On the other hand, if a is 
small, a large number of iterations will be required to sufficiently cover the interior of the ellipse with 
points. If one looks at the process of sampling as a random walk, with approximate step size er, then 


the number of iterations needed to cover a scale of the order of er max will be if er — ermin. this 

becomes j | n high dimensions, where there is high probability for one of the dimensions to 


be of relatively small scale compared to the maximum one, this square dependence rule of thumb can 
slow convergence substantially. 



14.8 THE METROPOLIS METHOD 


757 






(C) (D) 


FIGURE 14.10 

(A) The region of significant probability mass is enclosed by the ellipse, with scales <T max and a m ; n , respectively, 
as defined by the major and minor axes. The region of significant probability mass of the proposal distribution is 
spherical of scale equal to <x, which is of the same order as a m ; n . (B) Starting from the point shown as a circle, 

50 generated points are shown using a proposal distribution of covariance equal to a~ = 0.1 /; rejected points are 
shown in red. (C) The snapshot with 100 points and (D) with 3000 points. Observe that even in the latter case, stili 
there are parts in the high-probability region of the desired distribution that have not been covered. 


Figs. 14.10B-D show the case where the desired two-dimensional Gaussian has zero mean and 
covariance matrix given by 


1.00 0.99 
0.99 1.00 


The proposal distribution is a Gaussian with covariance matrix 0.1/. The figures show three snapshots 
in the sequence of point generation, corresponding to 50, 100, and 3000 points. The rejected ones are 
denoted in red. 
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(A) (B) 



(C) 


FIGURE 14.11 

The desired distributiori comprises the mixture of two Gaussians. In each figure, a different initialization point is 
selected. In all cases, 400 points have been generated. Note that in two of the cases, the process seems to be trapped 
in either one of the two Gaussians. 


Another problem that may arise with the Metropolis method is that of local trapping. This may 
occur when the desired distributiori is multimodal, which is common in high-dimensional complex 
problems. We will demonstrate the case via a simple example in the two-dimensional space. Let the 
desired distribution comprise a mixture of two Gaussians, 

p{x) = ^U{x\p, x , Xi) + ^Af(x\ii 2 , Ei), 

where fi l = [0, 0] r , fi 2 = [5, 5] r , = S 2 = diag{0.25, 2}, and theproposal distribution is Af(x\n, /), 

where /t = [2.5, 2.5] 7 . Fig. 14.11 shows the paths traveled by the drawn and accepted points over three 
different runs. In Figs. 14.11 A and B after 400 iterations the points drawn cover only one of the two 
mixtures. Both mixtures are visited in the run corresponding to Fig. 14.1 1C. 


14.9 GIBBS SAMPLING 

Gibbs sampling is among the most popular and widely used sampling methods. It is also known as the 
heat bath algorithm. Although Gibbs sampling was already known and used in statistical physics, two 
papers [9,10] were catalytic for its widespread use in the Bayesian and machine learning communities. 
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Josiah Willard Gibbs (1839-1903) was an American scientist whose work on thermodynamics laid 
the foundations of physical chemistry. Together with James Clark Maxwell (1831-1879) and Ludwig 
Eduard Boltzmann (1844-1906), Gibbs pioneered the field of statistical mechanics. He is also the 
father, together with Oliver Heaviside (1850-1925), of what is today known as vector calculus. 

Gibbs sampling is appropriate for drawing samples from multidimensional distributions, and it can 
be considered a special instance of the more general Metropolis method. 

Let the random vector in the Markov chain at time n be given as 

Ni — [X; 7 (l), . . . , X H (/)] . 

The basic assumption underlying Gibbs sampling is that the conditional distributions of each one of 
the variables, x n (d), <7=1,2,...,/, given the rest, that is, 

p(x n (d)\[x n (i): i^d}), (14.51) 

are known and they can be easily sampled. At each iteration, a sample is drawn for only one of the 
variables, based on Eq. (14.51), by freezing the values of the rest to those already available from the 
previous iteration. The scheme is summarized next. 

Algorithm 14.4 (Gibbs sampler). 

• Initialize, .ro(l),. ■ ■, xq (/), arbitrarily. 

• For n = 1, 2,..., /V, Do 

- For d = 1,2,...,/, Do 
* Draw 

• x n {d) ~ p(x | {■*■«(/), i < d ^ 1}, (/), i>d^l}) 

- End For 

• End For 

Note that in the previous scheme, all dimensions are visited in sequence. Another version is to visit 
them in random order. 

The Gibbs scheme can be viewed as a realization of a Markov chain, where the transition ma¬ 
trix/PDF is sequentially constructed from / base transitions, that is, 

T = B\ - ■ ■ B t , 

where each individual base transition acts on the corresponding dimension, that is, coordinate-wise. 
For continuous variables, it can be readily checked that 

Bd(x\y) = p(x{d)\{y{i)}: i ^ d) ]~[ <5(><(/) -x(ij), 4= 1,2,...,/. 

i^d 

In words, only the component x(d ) changes while the rest are left unchanged. It is not difficult to see 
that the desired joint distribution p(x) = p(x (\),..., x(l)) is invariant with respect to each one of H,i, 
d = 1,2,...,/ (Problem 14.13). Hence, it will be invariant under their product 


T = B\ - ■ ■ Bp 
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Ergodicity is ensured by requiring that all conditional probabilities are strictly positive, which guaran- 

tees the convergence of the chain to the desired pix). 

Remarks 14.3. 

• Gibbs sampling, being an instance of the Metropolis method, inherits its random walk-like conver¬ 
gence performance. 

• Gibbs sampling is appropriate for many graphical models (Chapter 16) that are described in terms of 
conditional distributions. Often, these distributions can be sampled in an easy way, using techniques 
such as rejection sampling and its variants, as discussed in Section 14.4. 

• Note that in Gibbs sampling, no samples are rejected. This can also be shown if Gibbs sampling is 
considered as an instance of the Metropolis method, via the specific choice of Eq. (14.51) as the 
proposal distribution (Problem 14.14). 

• Blocking Gibbs sampling: Gibbs sampling samples one variable at a time. This can make the algo- 
rithm move very slowly through the state space in case the variables are highly correlated. In such 
cases, it is preferable to sample groups of variables, not necessarily disjoint, and sample from the 
variables in the block, conditioned on the remaining. This is known as blocking Gibbs sampling 
[16], and it improves performance by achieving much bigger moves through the state space. 

• Collapsed Gibbs sampling: In collapsed Gibbs sampling, one integrates out (marginalizes over) one 
or more variables and samples from the remaining ones. For example, in the case of three variables, 
Gibbs sampling samples from p(xi\x 2 ,xs), then p(x 2 \x] , X3), and finally p (X3 \x] , X 2 ) to complete 
the iteration step. In collapsed Gibbs sampling, we can integrate out, for example, xj (which is 
collapsed), and sample sequentially from p(x\ \x 2 ) and p(x 2 \x\). Then sampling is performed in a 
lower-dimensional space, and hence it is more efficient. Collapsing one variable is tractable if it is a 
conjugate prior of another involved variable; for example, they are both members of the exponential 
family. Thus, X3 does not participate in Gibbs sampling. In the sequel, we can sample p(x?,\x\, X2). 
This can be justified by the Rao-Blackwell theorem, which States that the variance of the estimate 
created by analytically integrating out X3 will always be lower than (or equal to) the variance of 
direct Gibbs sampling [23]. 


14.10 IN SEARCH OF MORE EFFICIENT METHODS: A DISCUSSION 

In order to sidestep the drawbacks associated with the described basic Markov chain-based schemes, 
namely, the slow random walk-like convergence and the local trap problem, a number of more advanced 
methods have been suggested. It is beyond the scope of this chapter to present such schemes in more 
detail and the interested reader may consuit more specialized books and articles, e.g., [5,22,24,30,38]. 
Below, we provide a short discussion on some of the most popular directions that have been proposed. 

A family of algorithms known as auxiliary variable Markov chain Monte Carlo methods is a pop¬ 
ular one. Such methods augment with auxiliary variables either the desired or the proposal distribution 
in the Metropolis-Hastings algorithm. The presence of the auxiliary variable is intended to either help 
the algorithm to escape from possible local traps, or to cancel out the normalizing constant, if this is 
intractable. Such methods include algorithms like the simulating annealing [17], the simulated temper- 
ing [25], and the slice sampler [15]. The rationale behind the slice sampling techniques builds around 
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our discussion in Section 14.4; recall that sampling from pix) is equivalent to sampling uniformly 
from the region in 

A = |(jc, u ): x e K / , 0 < u < /?(*) j ■ 

In [31], a Gibbs-type implementation of the slice sampler is suggested; each component of x is up- 
dated sequentially using a single-variable slicing sampling strategy. It turns out that the slice sampler 
improves upon the convergence speed of the Standard Metropolis-Hastings algorithm. 

In [29], an auxiliary variable is used so that the computation of the normalizing constant is bypassed. 
This is important in cases where its computation is intractable. 

Another sampling philosophy spans the population-based methods. In order to overcome the local 
trap problem, a number (population) of Markov chains are run in parallel under an information ex- 
change strategy, which improves convergence. Typical examples of such techniques include adaptive 
directiori sampling [12] and the evolutionary Monte Carlo method, which builds upon arguments used 
in genetic algorithms [22]. 

Another direction is that of the Hamiltonian Monte Carlo methods, which exploit arguments from 
classical mechanics around the elegant Hamiltonian equations [26,30]. For PDFs of the form 


p(x) 


1 


exp (— E(x)), 


E{x) can be given the interpretation of the system’s potential energy (Section 15.4.2). Once such a 
bridge has been established, an auxiliary random vector, q, is introduced and interpreted as the mo¬ 
mentum of the system; hence, the corresponding kinetic energy is expressed as 



The Hamiltonian function is then given by 

H{x , q) = E(x) + K{q), 


and it defines the distribution 


1 , , 

p(x,q) = — exp( - H(x,q)) 

Z h 

= ^-exp(-£) -!-exp(- K{q)) 

Z£ Lk 

: = p(x)p(q), 

where Z % is the normalizing constant of the respective Gaussian term associated with the kinetic 
energy. The desired distribution, p(x), is obtained as the marginal of p(x, q). Hence, if sampling 
from p(x, q) is possible, then discarding q results in samples drawn from the desired distribution. The 
evolution of the variables in time is given by the associated Hamiltonian dynamics of the equivalent 
system. 
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Such methods may lead to a substantial improvement in convergence speed; the reason is that via 
the Hamiltonian interpretation, information hidden in the derivatives of E(x) (i.e., q = — 3E d ( * ) ) is 
exploited in order for the system to detect directions toward high-probability mass. 

In the reversible jump Mcirkov chain Monte Carlo algorithms, the Metropolis-Hastings algorithm 
is extended to account for state spaces of varying dimensionality [13]. Such methods are appropriate 
in cases where multiple parameter models of varying dimensionality are involved. Thus, the Markov 
chain is given the liberty to jump between models of different dimensionality. 

VARIATIONAL INFERENCE OR MONTE CARLO METHODS 

In the beginning of this chapter we mentioned that the variational inference techniques, which were 
considered in Chapter 13, are the deterministic alternatives of the Monte Carlo methods. We will now 
attempt to sketch in a few lines the pros and cons of each of the two approaches. The main advantages 
concerning the former path to Bayesian learning are as follows: 

• They are computationally more efficient for small- and medium-scale tasks. 

• It is fairly easy to determine when to stop iterations and to know when convergence has been 
achieved. 

• One can compute lower bounds for the likelihood function. 

The advantages concerning Monte Carlo methods are as follows: 

• They can be applied to more general cases, for example, models without computationally convenient 
priors, or to models whose structure is changing. 

• They do not rely on approximations such as the mean field approximation. 

• They can be more efficient for large-scale tasks. 


14.11 A CASE STUDY: CHANGE-POINT DETECTION 

The task of change-point detection is of major importance in a number of scientihc disciplines, ranging 
from engineering and sociology to economics and environmental studies. The accumulated literature is 
vast; see, for example, [1,19,36] and the references therein. The aim of the change-point identihcation 
task is to detect partitions in a sequence of observations, in order for the data in each block to be 
statistically “similar,” in other words, to be distributed according to a common probability distribution. 
The hidden Markov models and the dynamical Bayesian methods, which are discussed in Chapter 17, 
come under this more general umbrella of problems. Our goal in this example is to demonstrate the use 
of Gibbs sampling in the context of the change-point detection task (see, e.g., [4]). 

Let x„ be a discrete random variable that corresponds to the count of an event, for example, the 
number of requests for telephone calls within an interval of time, requests for individual documents on 
a web server, particle emissions in radioactive materials, or number of accidents in a working environ- 
ment. We adopt the Poisson process to model the distribution of x„, that is, 


P(x\X) 


qr y 

x\ 


e 


—kz 


(14.52) 
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Poisson processes have been widely used to model the number of events that take place in a time 
interval, r. For our example, we have chosen x —l. The parameter X is known as the intensity of the 
process (see, e.g., [32]). 

We assume that our observations, x n , n = 1,2,.... /V, have been generated by two different Pois¬ 
son processes, P(x\ k \) and P(x: X 2 ). Also, the change of the model has taken place suddenly at an 
unknown time instant, «o- Our goal is to estimate the posterior, 

P(n 0 \X l ,X 2 ,x l - N ). 

Moreover, the exact values of X\ and X 2 are not known. The only available information is that the 
Poisson process intensities, X, , i — \, 2, are distributed according to a (prior) gamma distribution, 
that is, 

p(X) — Gamma(A.|a, b) = - b a X a ~ l exp (—bX), 

T(a) 

for some known positive values a, b. We will finally assume that we have no prior information on when 
the time of change occurred; thus, the prior is chosen to be the uniform distribution P(n q) = jj. Based 
on the previous assumptions, the corresponding joint distribution is given by 

p(n 0 , A.i, X 2 , x i;iv) = p(x i-.n\Xi,X 2 , no)p(Xi)p(X 2 )P(n 0 ), 


p{no,Xi,X 2 ,x\- N ) = Y\P(x„\Xi) ]""[ P(xn\X 2 )p(Xi)p(X 2 )P(n 0 ). 

n= 1 «=«o+l 

Taking the logarithm in order to get rid of the products, and integrating out respective variables, the 
following conditionals needed in Gibbs sampling are obtained (Problem 14.15): 

p(X i \iiq, X 2 ,xi-n) — Gamma(ki|ai, b\), (14.53) 


with 


«o 

ai=a + ^^x n , b[—b + no, 
n =1 

p(X 2 \no, X\,xi-n) — Gamma(k 2 l« 2 , b 2 ), (14.54) 

N 

a 2 —a+ ^ x„, b 2 — b+(N-no), 
n=« 0 +l 


and 


«0 N 

P(n 0 \X l ,X 2 ,x l:N ) = lnkj -n 0 ^i+lnA 2 ^ x„ 

«= 1 «=«o+l 


-(N- n 0 )X 2 , n 0 =l,2,...,JV. 


(14.55) 
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n 


FIGURE 14.12 

Number of deadly accidents per year in the coal mines in England over the period 1851-1962. 


Note that the first two conditionals are gamma distributed and, as we said at the end of Section 14.4, 
a number of different approaches are available for generating samples from it. The last distribution is 
a discrete one, and samples can be drawn as discussed in Algorithm 14.1. We are now ready to apply 
Gibbs sampling. 

Algorithm 14.5 (Gibbs sampling for change-point detection). 

• Having obtained jcpjv := (jci, . .., xjv}, select a and b. 

• Initialize ;?q 0) 

• For i — 1,2,.... Do 

- ~ Gamma(A.|a + x n, b + n ( Q~ l) ) 

- A- 2 ^ ~ Gamma(A.|a + '^ l N _ (( _i) x„, b + (N - n ( '~ l) )) 

n —/ 2 q t" i 

- ~ P(no\^\ 

• End For 

Fig. 14.12 shows the number of deadly accidents per year in the coal mines in England spanning 
the years 1851-1962. Looking at the graph, it is readily observed that the “front” part of the graph 
looks different from its “back” end, with a change around 1890-1900. As a matter of fact, in 1890, 
new health and safety regulations were introduced, following pressure from the coal miners’ unions. 
We will use the model explained before and draw samples according to Algorithm 14.5 in order to 
determine the point, «o, where a change in the statistical distributions describing the data occurred [4]. 
The values of a and b were chosen equal to a = 2 and b — 1 , although the obtained results are not 
sensitive in their choice. The burn-in phase was 200 samples. Fig. 14.13 shows the obtained histogram 
of the values of no drawn by the algorithm, which clearly indicates a peak at the year 1890. Fig. 14.14 
shows the plot of the points drawn for Ai and A. 2 . The plot clearly indicates that the intensity of the 
Poisson process dropped from a i = 3 to /,2 = 1 after the introduction of the safety regulations. 
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n 


FIGURE 14.13 

The histogram obtained from the values of no generated by the algorithm, which approximates the posterior for no, 
for the case study of Example 14.1 1. Observe that the histogram peaks at 1890, the year when the new regulations 
were introduced. 
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FIGURE 14.14 

Case study of Example 14.11: The cluster formed by the obtained values of Ai and Xi. 


PROBLEMS 

14.1 Show that if F x (x) is the cumulative distribution function of a random variable x, then the 
random variable u = F x (x) follows the uniform distribution in [0, 1]. 

14.2 Show that if u follows the uniform distribution and 

x= F x _1 (u) :=g(u), 

then indeed x is distributed according to F x (x) — f* p(x)dx. 


(14.56) 
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14.3 


Consider the random variables r and <j) with exponential and 

Pr(r)= ^exp(-0, r > 


uniform distributions 

0, 


and 


mM) = 


2n ’ 


o, 


0 < <p < 2 tc, 
otherwise, 


respectively. Show that the transformation 


x = Vrcos(|) = g x (r, <j>), 
y= Vrsin<|) = g y (r, c|>) 

renders both x and y to follow the normalized Gaussian A/"(0, 1). 

14.4 Show that if 

p x (x) = Af(x\0, /), 

then y given by the transformation 

y — Lx + fi 

is distributed according to 

p y (y)=J\f(y\p, S), 

where E = LL 1 . 

14.5 Consider two Gaussians 

p{x)=N(x\0,a 2 p I), a; = 0.1 

and 

q(x) — A/”(JC |0, Oql), Gq —0.11, 

with x e IRE In order to use q(x) for drawing samples from p(x) via the rejection sampling 
method, a constant c has to be computed so that 

cq(x) > p(x). 


Show that 



and compute the probability of accepting samples. 

14.6 Show that using importance sampling leads to an unbiased estimator for the normalizing con¬ 
stant of the desired distribution, 


1 

P(x) = ~<P(x). 
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However, the estimator of E[/(x)] for a function / is a biased one. 

14.7 Let p(x) = A/”(jc|0, ofl). Choose the proposal distributiori for importance sampling as 

q(x) = .AffrlO, er?/). 


The weights are computed as 


w(x) = 


p(x) 

q(x) ' 


If w ( 0 ) is the weight at x — 0 , then the ratio is given by 


w(x) 
w ( 0 ) 



°t - 

v a l a 2 



Observe that even for a very good match between q(x) and p(x) (of — of), for large values 
of /, the values of the weights can change significantly due to the exponential dependence. 

14.8 Show that a stochastic matrix P always has the value X = 1 as its eigenvalue. 

14.9 Show that if the eigenvalue of a transition matrix is not equal to one, its magnitude cannot be 
larger than one, that is, |A.| < 1. 

14.10 Prove that if P is a stochastic matrix and X ^ 1, then the elements of the corresponding eigen- 
vector add to zero. 

14.11 Prove the square root dependence of the distance traveled by a random walk with infinitely 
many integer States on the time n. 

14.12 Prove, using the detailed balance condition, that the invariant distribution associated with the 
Markov chain implied by the Metropolis-Hastings algorithm is the desired distribution, p(x). 

14.13 Show that in Gibbs sampling, the desired joint distribution is invariant with respect to each one 
of the base transition PDFs. 

14.14 Show that the acceptance rate for the Gibbs sampling is equal to one. 

14.15 Derive the formulas for the conditional distributions of Example 14.11. 


MATLAB® EXERCISE 

14.16 Develop a MATLAB® code for the Gibbs sampler. Then, use it to draw samples from the 
two-dimensional Gaussian distribution, with mean value and covariance matrix equal to 


Derive the conditional PDF of each one of the variables with respect to the other (use the 
appendix associated with Chapter 12, which can be downloaded from the book’s website). 
Then use the conditional PDFs to implement the Gibbs sampler. Plot the generated points in 
the two-dimensional space after 20, 50, 100, 300, and 1000 iterations. What do you observe 
concerning convergence? 
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15.1 INTRODUCTION 

In Fig. 13.2, we used a graphical description to indicate conditional dependencies among various pa- 
rameters that control the “fusion” of the prior and conditional PDFs in a hierarchical manner. Our 
purpose there was more of a pedagogical nature; we could live without it. In this chapter, graphical 
models emerge out of necessity. In many everyday machine learning applications involving multivari- 
ate statistical modeling, even simple inference tasks can easily become computationally intractable. 
Typical applications involve bioinformatics, speech recognition, machine vision, and text mining, to 
name but a few. 

Graph theory has proved a powerful and elegant tool that has extensively been used in optimization 
and computational theory. A graph encodes dependencies among interacting variables and can be used 
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to formalize the probabilistic structure that underlies our modeling assumptions. This can then be 
used to facilitate computations in a number of inference tasks, such as the calculation of marginals, 
modes, and conditional probabilities. Moreover, graphical models can be used as a vehicle to impose 
approximations onto the models when computational needs go beyond the available resources. 

Early celebrated examples of the use of such models in learning tasks are the hidden Markov mod¬ 
els, Kalman filtering, and error correcting coding, which have been popular since the early 1960s. 

This is the first of two chapters dedicated to probabilistic graphical models. This chapter focuses 
on the basic definitions and concepts, and most of its material is a must for a first reading on the topic. 
A number of basic graphical models are discussed, such as Bayesian networks (BNs) and Markov 
random fields (MRFs). Exact inference is presented, and the elegant message passing algorithm for 
inference on chains and trees is introduced. 


15.2 THE NEED FOR GRAPHICAL MODELS 

Let us consider a simplified example of a learning system in the context of a medical application. Such 
a system comprises a set of m diseases that correspond to hidden variables and a set of n symptoms 
(findings). The diseases are treated as random variables, dj, Ai ,..., d,„, and each of them can be absent 
or present and thus can be encoded by a zero or a one, that is, d ; - e {0,1}, j = 1,2,..., m. The same 
applies to the symptoms f/, which can either be absent or present; hence, f,- e {0, 1}, i = 1,2,, n. 
The symptoms comprise the observed variables. 

The goal of the system is to predict a disease hypothesis, that is, the presence of a number of 
diseases, given the presence of a set of symptoms which have been observed. During the training, which 
is based on experts’ assessments, the system learns the prior probabilities P(dj ) and the conditional 
probabilities Pjj = P(fj — 1| dj = 1), i = 1,2,..., n, j = 1,2,..., m. The latter comprise a table of 
nm entries. For a realistic system, these numbers can be very large. For example, in [41], m is of the 
order of 500-600 and n of the order of 4000. Let / be the vector that corresponds to a specific set of 
observations for the findings, indicating the presence or absence of the respective symptoms. Assuming 
that symptoms are conditionally independent , given any disease hypothesis, d, we can write 

n 

P(f\d) = Y\P(fi\d). (15.1) 

1=1 

Ideally, one should be able to obtain the conditional probabilities P(f,\d) for each disease hypothesis. 
However, for ali possible 2 m combinations of d. this should require a huge amount of training data, 
which is impossible to collect for any practical system. This is bypassed by adopting the following 
model: 

m 

P(f i =0\d) = Y\(l-P ij ) dj , (15.2) 

7=1 

where the exponent is set to zero, dj — 0, when the disease is not related to the symptom. This is 
known as the noisy-OR model. That is, it is assumed that for a negative finding the individual causes 


In a more realistic system, some of the findings may not be available, that is, they may be unobservable. 


1 
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are independent [37]. Obviously, 


P(f i = l\d) = l-P(f i =0\d). 


Let us now assume that we observe a set of findings, /, and we want to infer P(ch \ f) for some j. 


Then 


P(f\dj = 1 )P(dj = 1) 

P(f ) 

T,d:dj = l P(f\d)P(d) 

Zd p (f\d)P(d) 


P{dj = \\f) = 


(15.3) 


The summation in the denominator involves 2 m terms. For m ~ 500, this is a formidable task that 
simply cannot be carried out in a realistic time. 

The previous example indicates that once one gets involved with complex systems, even innocent 
looking tasks turn out to be computationally intractable. Thus, one has either to be more elever in 
exploiting possible independencies in the data, which can reduce the required number of computations, 
or make certain assumptions/approximations. In this chapter, we will study both altematives. 

Before we proceed further, it is interesting to point out another source of computational obstacles 
besides the calculation of Eq. (15.3). In practice, it may be more convenient to perform addition instead 
of implementing multiplication; multiplying a large number of variables of small values such as proba- 
bilities may cause arithmetic accuracy problems. One way to bypass products is either via logarithmic 
or exponential operations, which transform products into summations. For example, Eq. (15.2) can be 
rewritten as 



(15.4) 


where Qjj := — ln(l — Pjj) and 



(15.5) 


Observe that the presence in Eq. (15.1) of terms corresponding to negative findings contributes linearly 
to the complexity (product of exponentials correspond to summations). However, this is not the case 


with the terms associated with positive findings. Take, for example, the extreme case where ali findings 
are negative. Then 



(15.6) 
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Consider now fi = 1 and the rest to be f) = 0, i = 2,... ,n. Then 
P(f\d)= ^1 -exp 

where the number of exponents to be computed is two. It can easily be shown that the cross-product 
terms lead to an exponential computational growth [20] (Problem 15.1). 

The path we will follow in order to derive efficient exact inference algorithms as well as to derive 
efficient approximation rules, when exact inference is not possible, will be via the use of graphical 
models. 


■ X > d i ex p - X 1] °‘j d j 


(15.7) 


j =1 


/=2 \j=l 


15.3 BAYESIAN NETWORKS AND THE MARKOV C0NDITI0N 

Before we move on to definitions. let us first see how the existence of some structure in a joint dis- 
tribution can simplify the task of marginalization. We will demonstrate it using discrete probabilities, 
where the use of counting can make things simpler. 

Let us consider / discrete jointly distributed random variables. Applying the product rule of proba- 
bility, we obtain 

P (x l,X 2 , ...,xi) = p (xt\xi-i,xi- 2 , P (xi- i|x/_ 2 ,... ,xi) ... P (x i). (15.8) 

Assume that each one of these variables takes values in the discrete set {1,2 .A:}. In the general 

case, if we want to marginalize with respect to one of the variables, say, xj, we must sum over the 
others, that is, 

p (xi)=x ■ ■ ■ x p (xi, x 2 ,..., xi ), 

x 2 Xi 

where each one of the summations is over k possible values, which is equivalent to O (k 1 ) summations; 
for large values of k and/or /, this is a formidable and sometimes impossible task. Let us consider now 
one extreme case, where ali the involved variables are mutually independent. Then the product rule 
becomes 

l 

P(x i , X 2 , . .. , X/) = Y\ P (■*;) - 
1=1 

and marginalization tums out to be the trivial identity 

P(x\) = (L P{x,) P(xi- 1 )' ■ ■ X! p ( x 2 ))P(x t), (15.9) 

XI XI -1 X2 

because each summation is carried out independently, and of course results to one. In other words, 
exploiting the product rule and the statistical independence can bypass the obstacle of the exponential 
growth of the computational load. As a matter of fact, the previous full-independence assumption gives 
birth to the naive Bayes classifier (Chapter 7). 
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In this chapter, we are going to study cases that lie between the previous two extremes. The general 
idea is to be able to express the joint probability distribution (probability density/mass function) in 
terms of products of factors, where each one of them depends on a subset of the involved variables. 
This can be expressed by writing the joint distribution as 


/ 

p(x i,X 2 ,..., xi) = Y[ P ( x i |Pa;), 
i=l 


(15.10) 


where Pa, denotes the subset of variables associated with the random variable x,. Take the following 
example: 

p (Xl,X2, X 3 , X 4 , X 5 , X 6 ) = p (x 6 |x 4 ) p (x 5 |x 3 , X 4 ) p (x 4 l.Xi.X 2 ) p (x 3 1 x 1 ) p (x 2 ) p (xi). (15.11) 

Then Pag = {x 4 }, Pas = [x 3 , x 4 }, Pa 4 = {xj, x 2 }, Pa 3 = {x 4 }, Pa 2 = 0, Pai = 0. The variables in the set, 
Pa,, are defined as the parents of the respective x,, and from a statistical point of view this means that x, 
is statistically independent of ali the variables given the values of its parents. Every p (x, |Pa, ) expresses 
a conditional independence relationship and it imposes a probabilistic structure that underlies our 
multivariate set. It is such types of independencies that we will exploit in order to perform inference 
tasks at a lower computational cost. 

15.3.1 GRAPHS: BASIC DEFINITIONS 

A graph G = {V, E} is a collection of nodes/vertices V = {xi..... x/} and a collection of edges (ares) 
E C V x V. Each edge connects two vertices and it is denoted as a pair, (x,-, xj ) e E. An edge can 
be either directed —then we write (x,- —> x,-) to indicate the direction—or undirected —then we simply 
write (x,, x ; ). Suppose we have a set of nodes x 1 . x 2 ,..., x*, k > 2, and a corresponding set of edges 
(x/_i, x,-) e E or (x,_i —> x,) e E, 2 < i < k; that is, the edges connect pairs of nodes in sequence and 
they can be either directed or not. This sequence of edges is called a path from xi to Xk- If there is at 
least one directed edge, the path is called directed. A cycle is a path from a node to itself. A chain or 
a trciil is a path that can be “run” either from x\ to Xk or from x/. to x \; that is, ali directed edges are 
replaced by undirected ones. 

A directed graph comprises directed edges only, and it is called a directed acyclic graph (DAG) if 
it contains no cycles. Given a DAG, a node in it, x/, is called parent of xj if there is a (directed) edge 
from Xi to Xj, and we call x/ the child of x;. A node xj is called a descendant of x; and x; an ancestor 
of Xj if there is a path from x,- to x. A node x; is called nondescendant of x,- if it is not a descendant 
of xt. A graph is said to b efully connected or complete if there is an edge between every pair of nodes. 
Fig. 15.1 illustrates the previous definitions. 

Definition 15.1. A Bayesian network structure is a DAG whose nodes represent random variables, 
xi,..., x/, and every variable (node), x,-, is conditionally independent of the set of ali its nondescen- 
dants, given the set of all its parents. Sometimes this is also known as the Markov condition. 

If we denote the set of the nondescendants of a node x,- as /V I ),, the Markov condition can be writ- 
ten as [12] x,-_UVD, |Pa,-, V i = 1,2,..., Z. Sometimes, the conditional independencies are also known 
as locat independencies. Stated differently, a BN graphical structure is a convenient way to encode 
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X\ Xi 




FIGURE 15.1 

(A) This is a DAG because there are no cycles; x\ is a parent of both X2 and 33; X4 and 35 are children of 33; x\, X2, 
and 33 are ancestors of X4, while X 4 and .35 are descendants of 31; 35 is a nondescendant of X2 and X4. (B) This is 
not a DAG and the sequence (X2, x \, 33, X4, X2 ) comprises a cycle. The edge (33, .35) is undirected. The sequence 
of nodes (31,33,35) forms a directed path. and the sequence (31,33,34,33) forms a chain, once directed edges are 
replaced by undirected ones. 


Xl x 2 



FIGURE 15.2 

The BN structure corresponding to the PDF in Eq. (15.11). Observe that X 5 is conditionally independent of xj and 
X2 given the values of X3 and X4. Note that nodes in a BN structure correspond to random variables. 


conditional independencies. Fig. 15.2 shows the DAG that expresses the conditional independencies 
used in Eq. (15.11), in order to express the joint distribution as a product of factors. Conditional in- 
dependence among random variables, that is, x_Ly|z, or, equivalently, p{x\y,z) = p(x\z), means that 
once we know the value of z, observing the value of y gives no additional information about x (note 
that the previous makes sense only if p(y, z) > 0). For example, the probability of children having a 
good education depends on whether they grow up in a poor or a rich (low or high gross national prod¬ 
uct [GNP]) country. The probability of someone getting a high-paying job depends on her/his level 
of education. The probability of someone getting a high-paying job is independent of the country in 
which he or she was born and raised, given the level of her/his education. 

Theorem 15.1. Let G be a BN structure and let p be the joint probability distribution ofthe random 
variables associated with the graph. Then p is equal to the product ofthe conditional distributions of 
a!l the nodes given the values oftheir parents, and we say that p factorizes over G. 

The proof of the theorem is done by induction (Problem 15.2). Moreover, the reverse of this theorem 
is also true. The previous theorem assumed a distribution and built the BN based on the underlying 
conditional independencies. The next theorem deals with the reverse procedure. One builds a graph 
based on a set of conditional distributions—one for each node of the network. 
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Xl x 2 Xl 

o o • • • o 


FIGURE 15.3 

Bayesian network structure for independent variables. No edges are present because every variable is independent of 
ali the others and no parents can be identified. 


Theorem 15.2. Let G be a DAG and associate a conditional probability for each node, given the 
values of its parents. Then the product of the se conditional probabilities yields a joint probability of 
the variables. Moreover, the Markov condition is satisfied. 

The proof of this theorem is given in Problem 15.4. Note that in this theorem, we used the term 
probability and not distribution. The reason is that the theorem is not true for every form of conditional 
densities (PDFs) [14]. However, it holds true for a number of widely used PDFs, such as the Gaussians. 
This theorem is very useful because, often in practice, this is the way we construet a probabilistic 
graphical model—building it hierarchically, using reasoning on the corresponding physical process 
that we want to model, and encoding conditional independencies in the graph. 

Fig. 15.3 shows the BN structure describing a set of mutually independent variables (naive Bayes 
assumption). 

Definition 15.2. A Bayesian network (BN) is a pair (G. p), where the distribution p factorizes over 
the DAG G, in terms of a set of conditional probability distributions, associated with the nodes of G. 

In other words, a BN is associated with a specific distribution. In contrast, a BN structure refers to 
any distribution that satisfies the Markov condition as expressed by the network structure. 

Example 15.1. Consider the following simplified study relating the GNP of a country to the level of 
education and the type of a job an adult gets later in her/his professional life. Variable xi is binary 
with two values, HGP and LGP, corresponding to countries with high and low GNP, respectively. 
Variable X 2 gets three values, NE, LE, and HE, corresponding to no education, low-level, and high-level 
education, respectively. Finally, variable X 3 gets also three possible values, UN, LP, HP, corresponding 
to unemployed, low-paying, and high-paying jobs, respectively. Using a large enough sample of data, 
the following probabilities are learned: 

1. Marginal probabilities: 

P(x 1 = LGP) = 0.8, P{x 1 = HGP) = 0.2. 

2 . Conditional probabilities: 

P(x 2 = NE\x\ = LGP) = 0.1, P(x 2 = LE\x\ = LGP) = 0.7, 

P(x 2 = HE |;ci = LGP) — 0.2, 

P(x 2 = NE\x\ = HGP) = 0.05, P(x 2 = LE\xi = HGP) = 0.2, 

P(x 2 = HE |xi = HGP) = 0.75, 

P(x 3 = UN\x 2 = NE) = 0.15, P(x 3 = LP\x 2 = NE) = 0.8, 
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P(x 3 = HP |;c 2 = NE) = 0.05. 

P(x 3 = UN\x 2 = LE) = 0.10, P(x 3 = LP\x 2 = LE) = 0.85, 
P(X 3 = HP\x 2 = LE) = 0.05, 

P(x 3 = UN\x 2 = HE) = 0.05, P(x 3 = LP\x 2 = HE) = 0.15, 
P(x3 = HP\x 2 = HE) = 0.8. 


Note that these values are not the resuit of a specific experiment. However, they are in line with the 
general trend provided by more professional studies, which involve many more random variables. 
However, for pedagogical reasons we keep the example simple. 

The first observation is that even for this simplistic example involving only three variables, one has 
to obtain 17 probability values. This verifies the high computational load that may be required for such 
tasks. 

Fig. 15.4 shows the BN that captures the previously stated conditional probabilities. Note that the 
Markov condition renders X 3 independent of xj, given the value of X 2 . Indeed, the job that one finds is 
independent of the GNP of the country, given her/his education level. We will verify that by playing 
with the laws of probability for the previously defined values. 

According to Theorem 15.2, the joint probability of an event is given by the product 


P(X l,X 2 ,X 3 ) = P(x 3 \X2)P(X2 \xi)P(xi). 


(15.12) 


In other words, the probability of someone coming from a rich country, having a good education, and 
getting a high-paying job will be equal to (0.8)(0.75) (0.2) = 0.12; similarly, the probability of some- 
body coming from a poor country, having a low-level education, and getting a low-paying job is 0.476. 

As a next step, we will verify the Markov condition, implied by the BN structure, using the prob¬ 
ability values given before. That is, we will verify that using conditional probabilities to build the 
network, these probabilities basically encode conditional independencies, as Theorem 15.2 suggests. 
Let us consider 


P(x 3 = HP\x 2 = HE , xi = HGP) = 


P(X 3 = HP, x 2 = HE,x 1 = HGP) 
P(x 2 = H E, x\ = HGP) 

0.12 


P(x 2 = H E, X] — HGP) 


Also, 


P(x 2 = HE, X! = HGP) = P(x 2 = HE\xi = HGP)P(x 1 = HGP) 
— 0.75 x 0.2 = 0.15, 





FIGURE 15.4 


BN for Example 15.1. Note that X 3 _Lxi |X 2 . 
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which finally results to 


P(x 3 = HP \x 2 = HE, x\ = HGP) = 0.8 

= P(x 3 = HP\x 2 = HE), 

which verifies the claim. The reader can check that this is true for ali possible combinations of values. 

15.3.2 SOME HINTS 0N CAUSALITY 

The existence of directed links in a Bayesian network does not necessarily reflect a cause-effect rela- 
tionship from a parent to a child node . 2 It is a well-known fact in statistics that correlation between two 
variables does not always establish a causal relationship between them. For example, their correlation 
may be due to the fact that they both relate to a latent (unknown) variable. A typical example is the 
discussion related to whether smoking causes cancer or they are both due to an unobserved genotype 
that causes cancer and at the same time a craving for nicotine; for many years, this argument was used 
as a defense line of the tobacco companies. 

Let us return to Example 15.1. Although GNP and quality of education are correlated, one cannot 
say that GNP is a cause of the educational system. No doubt there is a multiplicity of reasons, such 
as the political system, the social structure, the economic system, historical reasons, and tradition, all 
of which need to be taken into consideration. As a matter of fact, the structure of the graph relating 
the three variables in the example could be reversed. We could collect data the other way around; ob- 
tain the probabilities P(x 3 = UN ), P(x 3 = LP), P(x 3 = HP), and then the conditional probabilities 
P(x 2 IX 3 ) (e.g., P(x 2 = HE IX 3 = UN)) and finally P(x 1 IX 2 ) (e.g., P(xj = HGP\x 2 — HE)). In prin- 
ciple, such data can also be collected from a sample of people. In such a case, the resulting BN would 
comprise again three nodes as in Fig. 15.4, but with the direction of the arrows reversed. This is also 
reasonable because the probability of someone coming from a rich or a poor country is independent of 
her/his job, given the level of education. Moreover, both models should resuit in the same joint prob¬ 
ability distribution for any joint event. Thus, if the direction of the arrows were to indicate causality, 
then this time, it would be that the educational system has a cause-effect relationship on the GNP. 
This, for the same reasons stated before, cannot be justified. Having said all that, it does not necessar¬ 
ily mean that cause-effect relationships are either absent in a BN or it is not important to know them. 
On the contrary, in many cases, there is good reason to strive to unveil the underlying cause-effect 
relationships while building a BN. 

Let us elaborate a bit more on this and see why exploiting any underlying cause-effect relationships 
can be to our benefit. Take, for example, the BN in Fig. 15.5 relating the presence or absence of a 
disease with the findings from two medical tests. Let xi indicate the presence or absence of a disease 
and X 2 , X 3 the discrete outcomes that can resuit from the two tests. 

The BN in Fig. 15. 5A complies with our common sense reasoning that xj (disease) causes X 2 and 
X 3 (tests). However, this is not possible to deduce by simply looking at the available probabilities. This 
is because the probability laws are symmetric. Even if xi is the cause, we can stili compute P(x 1 IX 2 ) 


- This topic will not be pursued any further; its purpose is to make the reader aware of the issue. It can be bypassed in a first 
reading. 
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FIGURE 15.5 

Three possible graphs relating a disease, xi, to the results of two tests, xj, X3. (A) The dependencies in this graph 
comply with common sense. (B) This graph renders X2, X3 statistically independent, which is not reasonable. 

(C) Training this graph needs an extra probabiiity value compared to that in (A). 


once P(x 1 , a' 2 , X 3 ) and P(xj ) are available, that is, 


P(x l\x 2 ) = 


P(x i,x 2 ) 
P(x 2 ) 


J2x 3 P(x\,X 2,X-i) 


Previously, in order to say that xi causes X 2 and X 3 , we used some extra information/knowledge, which 
we called common sense reasoning. Note that in this case, training requires the knowledge of the values 
of three probabilities, namely, P(x\), P(x 2 |xi), and P(x^\x\ ). Let us now assume that we choose the 
graph model in Fig. 15. 5B. This time, ignoring the cause-effect relationship has resulted in the wrong 
model. This model renders X 2 and X 3 independent, which, obviously, cannot be the case. These should 
only be conditionally independent given xi. The only sensible way to keep X 2 and X 3 as parents of xi is 
to add an extra link, as shown in Fig. 15. 5C, which establishes a relation between the two. However, to 
train such a network, besides the values of the three probabilities Pfe), P(xj), and P(x\ \x 2 , JC 3 ), one 
needs to know the values for an extra one, P(x 3 1X 2 ) ■ Thus, when building a BN, it is always good to 
know any underlying cause-effect directions. Moreover, there are other reasons, too. For example, this 
may be related to the interventions, which are actions that change the state of a variable in order to study 
the respective impact on other variables; because a change propagates in the causal direction, such a 
study is only possible if the network has been structured in a cause-effect hierarchy. For example, in 
biology, there is a strong interest in understanding which genes affect activation levels of other genes, 
and in predicting the effects of turning certain genes on or off. 

The notion of causality is not an easy one, and philosophers have been arguing about it for centuries. 
Although our intention here is by no means to touch this issue, it is interesting to quote two well-known 
philosophers. 

According to David Hume, causality is not a property of the real world but a concept of the mind 
that helps us explain our perception of the world. Hume (1711-1776) was a Scottish philosopher best 
known for his philosophical empiricism and skepticism. His most well-known work is the “Treatise of 
Human Nature,” and in contrast to the rationalistic philosophy school, he advocated that human nature 
is mainly governed by desire and not reason. 

According to Bertrand Russell, the law of causality has nothing to do with the laws of physics, 
which are symmetrical (recall our statement before concerning conditional probabilities) and indicate 
no cause-effect relationship. For example, Newton’s gravity law can be expressed in any of the follow- 
ing forms: 
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B — mg or g = — or m = —, 
m g 

and looking only at them, no cause-effect relationship can be deduced. Bertrand Russell (1872-1970) 
was a British philosopher, mathematician, and logician. He is considered one of the founders of analytic 
philosophy. In Principia Mathematica , coauthored with A.N. Whitehead, they made an attempt to 
ground mathematics on mathematical logic. He was also an antiwar activist and a liberal. 

The previously stated provocative arguments have been inspired by Judea PearTs book [38], and 
we provided them in order to persuade the reader to read this book; he or she can only become wiser. 
Pearl has made a number of significant contributions to the field and was the recipient of the Turing 
award in 2011. 

Although one cannot deduce causality by looking only at the laws of physics or probabilities, ways 
of identifying it have been developed. One way is to carry out controlled experiments; one can change 
the values of the variable and study the effect of the change on another. However, this has to be done 
in a controlled way in order to guarantee that the caused effects are not due to other related factors. 

Besides experimentation, there has been a major effort to discover causal relationships from nonex- 
perimental evidence. In modern applications, such as microarray measurements for gene expressions or 
fMRI brain imaging, the number of the involved variables can easily reach the order of a few thousand. 
Performing experiments for such tasks is out of the question. In [38], the notion of causality is related 
to that of the minimality in the structure of the obtained possible DAGs. Such a view ties causality with 
Occam’s razor. More recently, inferring causality was attempted by comparing the conditional distri- 
butions of variables given their direct causes, for all hypothetical causal directions, and choosing the 
most plausible. The method builds upon some smoothness arguments that underlie the conditional dis- 
tributions of the effect given the causes, compared to the marginal distributions of the effect/cause [43]. 
In [24], an interesting alternative for inferring causality is built upon arguments from Kolmogorov’s 
complexity theory; causality is verified by comparing shortest description lengths of strings associated 
with the involved distributions. For further information, the interested reader may consuit, for example, 
[42] and the references therein. 

15.3.3 d-SEPARATION 

Dependencies and independencies among a set of random variables play a key role in understanding 
their statistical behavior. Moreover, as we have already commented, they can be exploited to substan- 
tially reduce the computational load for solving inference tasks. 

By the definition and the properties of a BN structure, G, we know that certain independencies hold 
and are readily observed via the parent-child links. The question that is now raised is whether there are 
additional independencies that the structure of the graph imposes on any joint probability distribution 
that factorizes over G. Unveiling extra independencies offers the designer more freedom to deal with 
computational complexity issues more aggressively. 

We will attack the task of searching for conditional independencies across a network by observing 
whether probabilistic evidence that becomes available at a node x can propagate and influence our 
certainty about another node, y. 

Serial or head-to-tail connection. This type of node connection is shown in Fig. 15. 6A. Evidence 
on x will influence the certainty about y, which in turn will influence that of z. This is also true for the 
reverse direction, starting from z and propagating to x. However, if the state of y is known, then x and 
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y x z 



Three different types of connections: (A) serial, (B) diverging, and (C) converging. 



Having some evidence about the weather being either cloudy or rainy establishes the path for Information flow 
between the nodes “season” and “country.” 


z become (conditionally) independent. In this case, we say that y blocks the path from x to z, and vice 
versa. When the state at a node is fixed/known, we say that the node is instantiated. 

Diverging or tail-to-tail connection. In this type of connection, shown in Fig. 15. 6B, evidence can 
propagate from y to x and from y to z, and also from x to z and from z to x via y, unless y is instantiated. 
In the latter case, y blocks the path from x to z, and vice versa. That is, x and z become independent 
given the value of y. For example, if y represents “flu,” x “runny nose,” and z “sneezing,” then if we 
do not know whether someone has the flu, a runny nose is evidence that can change our certainty 
about her/him having the flu; this in turn changes our belief about sneezing. However, if we know that 
someone has the flu, seeing the nose running gives no extra information about sneezing. 

Converging or head-to-head connection or v-structure. This type of connection is slightly more 
subtle than the previous two cases, and it is shown in Fig. 15. 6C. Evidence from x does not propagate 
to z and thus cannot change our certainty about it. Knowing something about x telis us nothing about 
z. For example, let z denote either of two countries (e.g., England and Greece), x “season,” and y 
“cloudy weather.” Obviously, knowing the season says nothing about a country. However, having some 
evidence about cloudy weather y, knowing that it is summer provides information that can change our 
certainty about the country. This is in accordance with our intuition. Knowing that it is summer and that 
the weather is cloudy explains away that the country is Greece. This is the reason we sometimes refer 
to this type of reasoning as explaining away. Explaining away is an instance of a general reasoning 
pattern called intercausal reasoning , where different causes of the same effect can interact; this is a 
very common pattern of reasoning in humans. 

For this particular type of connection, explaining away is also achieved by evidence that is provided 
by any one of the descendants of y. Fig. 15.7 illustrates the case via an example. Having evidence 
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about the rain will also establish a path so that evidence about the season (country), x (z), changes our 
certainty about the country (season), z (x). 

To recapitulate, let us stress the delicate point here. For the first two cases, head-to-tail and tail-to- 
tail, the path is blocked if node y is instantiated, that is, when its state is disclosed to us. However, in 
the head-to-head connection, the path between x and z “ opens ” when probabilistic evidence becomes 
available, either at y or at any one of its descendants. 

Definition 15.3. Let G be a BN structure, and let xj,..., xj. comprise a chain of nodes. Let Z be a 
subset of observed variables. The chain xi,..., x& is said to be active given the set Z, if 

• whenever a converging connection, x, _i -> x,- <— x, + i, is present in the chain, then either x,- or one 
of its descendants is in Z; 

• no other node in the chain is in Z. 

In other words, in an active chain, probabilistic evidence can flow from xj to x^, and vice versa, 
because no nodes (links) which can block this information flow are present. 

Definition 15.4. Let G be a BN structure and let X , Y, Z be three mutually disjoint sets of nodes in G. 
We say that X and Y are d-separated given Z if there is no active chain between any node x e Z and 
y e Y given Z. If these are not c/-separated, we say that they are d-connected. 

In other words, if two variables x and y are d-separated by a third one, z, then observing the state of 
z blocks any evidence propagation from x to y and vice versa. That is, c/-separation implies conditional 
independence. Moreover, the following very important theorem holds. 

Theorem 15.3. Let the pair (G, p) be a BN. For every three mutually disjoint subsets of nodes X, Y, Z, 
whenever X and Y are d-separated, given Z, for every pair (x, y) e X x Y, x and y are conditionally 
independent in p given Z. 

The proof of the theorem was given in [45]. In other words, this theorem guarantees that 
d-separation implies conditional independence on any probability distribution that factorizes over G. 
Note that, unfortunately, the opposite is not true. There may be conditional independencies that cannot 
be identified by d-separation (e.g., Problem 15.5). However, for most practical applications, the reverse 
is also true. The number of distributions that do not comply with the reverse statement of the theorem 
is infinitesimally small (see, e.g., [25]). Identification of all d-separations in a graph can be carried out 
via a number of efficient algorithms (e.g., [25,32]). 

Example 15.2. Consider the DAG G of Fig. 15.8, connecting two nodes x, y. It is obvious that these 
nodes are not d-separated and comprise an active chain. Consider the following probability distribution, 
which factorizes over G: 


P(y = 0|x = 0) = 0.2, P(y = 1 |x = 0) = 0.8, 

P(y = 0 |jc = 1) = 0.2, P(y = l\x = 1) = 0.8. 

It can easily be checked that T’(y|x) = P (y) (independent of the values of P(x = 1) and Pix = 0)) 
and the variables x and y are independent; this cannot be predicted by observing the d-separations. 
Note, however, that if we slightly perturb the values of the conditional probabilities, then the resulting 
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FIGURE 15.8 

This DAG involves no nodes that are rf-separated. 


distribution has as many independencies as those predicted by the d-separations, that is, in this case, 
none. As a matter of fact, this is a more general resuit. If we have a distribution which factorizes 
over a graph that has independencies that are not predicted by the respective t/-separations, a small 
perturbation will almost always eliminate them (e.g., [25]). 

Example 15.3. Consider the DAG shown in Fig. 15.9. The red nodes indicate that the respective ran- 
dom variables have been observed; that is, these nodes have been instantiated. Node X 5 is rf-connected 
to xi, X2, X(,. In contrast, node X9 is J-separated from all the rest. Indeed, evidence starting from xi is 
blocked by X3. However, it propagates via X4 (instantiated and converging connection) to X2, xg, and 
then to X 5 (X 7 is instantiated and a converging connection). In contrast, any flow of evidence toward X 9 
is blocked by the instantiation of X7. It is interesting to note that although all neighbors of X5 have been 
instantiated, it stili remains r/-connected with other nodes. 

Definition 15.5. The Markov blanket of a node is the set of nodes comprising (a) its parents, (b) its 
children, and (c) the nodes sharing a child with this node. Once all the nodes in the blanket of a node 
are instantiated, the node becomes d-separated from the rest of the network (Problem 15.7). 

For example, in Fig. 15 . 9 , the Markov blanket of X5 comprises the nodes X3, X4, X3, X7, xg. Note 
that if all these nodes are instantiated, then X 5 becomes t/-separated from the rest of the nodes. 

In the sequel, we give some examples of machine learning tasks, which can be cast in terms of a 
Bayesian graphical representation. As we will discuss, for many practical cases, the involved condi- 
tional probability distributions are expressed in terms of a set of parameters. 


Xl X 2 



FIGURE 15.9 

Red nodes are instantiated. Node X5 is r/-connected to xi, X2, X6 and node X9 is d-separated from all the nonob- 
served variables. 
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15.3.4 SIGMOIDAL BAYESIAN NETWORKS 

We have already seen that when the involved random variables are discrete, the conditional probabilities 
P(xi |Pa,), i = 1,..., Z, associated with the nodes of a Bayesian graph structure have to be learned from 
the training data. If the number of possible States and/or the number of the variables in Pa, is large 
enough, this amounts to a large number of probabilities that have to be learned; thus, a large number of 
training points is required in order to obtain good estimates. This can be alleviated by expressing the 
conditional probabilities in a parametric form, that is, 


P(*i|Pa 1 -) = P(*i|Pa j ;fl I -) I i = 1,2,...,/. (15.13) 

In the case of binary-valued variables, a common functional form is to view P as a logistic regression 
model; we used this model in the context of relevance vector machines in Chapter 13. Adopting this 
model, we have 


(15.14) 

(15.15) 


This reduces the number of parameter vectors to be learned to 0(1). The exact number of parameters 
depends on the size of the parent sets. Assuming the maximum number of parents for a node to be K , 
the unknown number of parameters to be learned from the training data is less than or equal to IK. 
Taking into account the binary nature of the variables, we can write 


P(xi = l|Pa Oi) = cr(ti) = 

U •— $/0 “I" ^ ^ @ikXk' 

fc.rr-ePa, 


1 + exp(—/,-) ’ 


P(x,-|Pa/; Oj) — Xj<j(ti) + (1 -x,)(l - a (f/)). 


(15.16) 


where r,- is given in Eq. (15.15). 

Such models are also known as sigmoidal Bayesian networks, and they have been proposed as one 
type of neural network (Chapter 18) (e.g., [33]). Fig. 15.10 presents the graphical structure of such a 
network. The network can be treated as a BN structure by associating a binary variable at each node and 
interpreting nodes’ activations as probabilities, as dictated by Eq. (15.16). Performing inference and 
training of the parameters in such networks is not an easy task. We have to resort to approximations. 
We will come to this in Section 16.3. 



Input nodes 


Hidden nodes 


Output nodes 


FIGURE 15.10 


A sigmoidal Bayesian network. 
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15.3.5 LINEAR GAUSSIAN MODELS 


The computational advantages of the Gaussian PDF have recurrently been exploited in this book. We 
will now see the advantage gains in the framework of graphical models when the conditional PDF at 
every node, given the values of its parents, is expressed in a Gaussian form. Let 



( 15 . 17 ) 


where of is the respective variance and 0,o is the bias term. From the properties of a BN, the joint 
PDF will be given by the product of the conditional probabilities (Theorem 15.2, which is valid for 
Gaussians), and the respective logarithm is given by 



( 15 . 18 ) 


This is of a quadratic form, and hence it is also of a Gaussian nature. The mean values and the co- 
variance matrices for each one of the variables can be computed recursively in a straightforward way 
(Problem 15.8). 

Note the computational elegance of such a BN. In order to obtain the joint PDF, one has only to 
sum up ali the exponents, that is, an operation of linear complexity. Moreover, concerning training, 
one could readily think of a way to learn the unknown parameters; adopting the maximum likelihood 
method (although it may not be necessarily the best method), optimization with respect to the unknown 
parameters is a straightforward task. In contrast, one cannot make similar comments for the training of 
the sigmoidal BN. Unfortunately, products of sigmoid functions do not lead to an easy computational 
procedure. In such cases, one has to resort to approximations. For example, one way is to employ the 
variational bound approximation, as discussed in Chapter 13, in order to enforce, locally, a Gaussian 
functional form. We will discuss this technique in Section 16.3.1. 


15.3.6 MULTIPLE-CAUSE NETWORKS 


In the beginning of this chapter, we started with an example from the field of medical informatics. 
We were given a set of diseases and a set of symptoms/findings. The conditional probabilities for each 
symptom being absent, given the presence of a disease (Eq. (15.2)), were assumed known. We can 
consider the diseases as hidden causes (h) and the symptoms as observed variables (y) in a learning 
task. This can be represented in terms of a BN structure as in Fig. 15.11. For the previous medical 
example, the variables h correspond to d (diseases), and the observed variables y to the findings f. 

However, the Bayesian structure given in Fig. 15.11 can serve the needs of a number of infer- 
ence and pattern recognition tasks, and sometimes it is referred to as a multiple-cause network, for 
obvious reasons. For example, in a machine vision application, the hidden causes, hi, h 2 ,..., lu, may 
refer to the presence or absence of an object, and y„, n — 1,..., N, may correspond to the values 
of the observed pixels in an image [15]. The hidden variables can be binary (presence or absence of 
the respective object) and the conditional PDF can be formulated into a parameterized form, that is. 
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FIGURE 15.11 

The general structure of a multiple-cause Bayesian network. The top-level nodes correspond to the hidden causes 
and the bottom ones to the observations. 


p{y n \h\ 0). The specific form of the PDF captures the way objects interact as well as the effects of 
the noise. Note that in this case, the BN has a mixed set of variables, the observations are continuous, 
and the hidden causes are binary. We will return to this type of Bayesian structure when discussing 
approximate inference methods in Section 16.3. 

15.3.7 l-MAPS, SOUNDNESS, FAITHFULNESS, AND COMPLETENESS 

We have seen a number of definitions and theorems referring to the notion of conditional independence 
in graphs and probability distributions. Before we proceed further, it will be instructive to summarize 
what has been said and provide some definitions that will dress up our findings in a more formal 
language. This will prove useful for subsequent generalizations. 

We have seen that a BN is a DAG that encodes a number of conditional independencies. Some of 
them are local ones, defined by the parent-child links, and some of them are of a more global nature 
and are the resuit of r/-separations. Given a DAG, G, we denote as I (G) the set of ali independencies 
that correspond to cLseparaticns. Also, let p be a probability distribution over a set of random variables, 
xi,..., x/. We denote as I(p) the set of ali independence assertions of the type x,- _L x ; - |Z that hold true 
for the distribution p. 

Let G be a DAG and p a distribution that factorizes over G; in other words, it satisfies the local 
independencies as suggested by G. Then we have seen (Theorem 15.3) that 

l (G) c I(p). (15.19) 

We say that G is an I-map (independence map) for p. This property is sometimes referred to as sound- 
ness. 

Definition 15.6. A distribution p is faithful to a graph G if any independence in p is reflected in the 
cZ-scparation properties of the graph. 

In other words, the graph can represent all (and only) the conditional independence properties of 
the distribution. In such a case we write I (p) — I(G). If equality is valid, we say that the graph G 
is a perfect map for p. Unfortunately, this is not valid for any distribution, p, that factorizes over G. 
However, for most practical purposes, I (G) = / (F), which is true for almost all distributions that 
factorize over G. 
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Although I (p) — I{G) is not valid for all distributioris that factorize over G, the following two 
properties are always valid for any BN structure (e.g., [25]): 

• If x _L y|Z for all distributioris p that factorize over G, then x and y are cZ-separated given Z. 

• If x and y are (/-connected given Z, then there will be some distribution that factorizes over G where 
x and y are dependent. 

A final definition concerns minimality. 

Definition 15.7. A graph G is said to be a minimal I-map for a set of independencies if the removal of 
any of its edges renders it not to be an I-map. 

Note that a minimal 1-map is not necessarily a perfect map. In the same way that there exist algo- 
rithms to find the set of rf-separations, there exist algorithms to find perfect and minimal I-maps for a 
distribution (e.g., [25]). 


15.4 UNDIRECTED GRAPHICAL MODELS 

Bayesian structures and networks are not the only way to encode independencies in distributions. As 
a matter of fact, the directionality assigned to the edges of a DAG, while being advantageous and 
useful in some cases, becomes a disadvantage in others. A typical example is that of four variables, 
xi , X2, X3, X4. There is no directed graph that can encode the following conditional independencies 
simultaneously: X[ _L X4KX2, X3} and X2 _L X3 1 [xj, X4}. Fig. 15.12 shows the possible DAGs; note that 
both fail to capture the desired independencies. 

In 15.12A, xi J_X 4 |[X 2 ,X 3 } because both paths which connect xi and X 4 are blocked. However, 
X2 and X3 are rf-connected given xi and X4 (why?). In Fig. 15.12B, X2 _L X3|[xi, X4} because the diverg- 
ing links are blocked. Flowever, we have violation of the other independence (why?). 

Such situations can be overcome by resorting to undirected graphs. We will also see that this type 
of graphical modeling leads to a simplification concerning our search for conditional independencies. 

Undirected graphical models or Markov networks or Markov randomfields (MRFs) have their roots 
in statistical physics. As was the case with the Bayesian models, each node of the graph is associated 




FIGURE 15.12 


None of these DAGs can capture the two independencies xi _L X4|{X2, X3} and X2 _L X3|{xi, X4}. 
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with a random variable. Edges connecting nodes are undirected, giving no preference to either of the 
two directions. Local interactions among connected nodes are expressed via functions of the involved 
variables, but they do not necessarily express probabilities. One can view these local functional inter¬ 
actions as a way to encode information related to the affinity/similarity among the involved variables. 
These local functions are known as potential functions or compatibility functions or factors, and they 
are nonnegative, usually positive, functions of their arguments. Moreover, as we will soon see, the 
global description of such a model is the resuit of the product of these local potential functions; this is 
in analogy to what holds true for the BNs. 

Following a path similar to that used for the directed graphs, we will begin with the factoriza- 
tion properties of a distribution over an MRF and then move on to study conditional independencies. 
Let xi,..., x/ be a set of random variables that are grouped in K groups, X|..... x^-; each random 
vector x*, k = 1,2,..., K, involves a subset of the random variables, x,-, i = 1,2,... ,1. 

Definition 15.8. A distribution is called a Gibbs distribution if it can be factorized in terms of a set of 
potential functions fi,..., i //k , such that 



( 15 . 20 ) 


k= 1 


The constant Z is known as the partition function and it is the normalizing constant to guarantee that 
pix i,.... xf) is a probability distribution. Hence, 



( 15 . 21 ) 


which becomes a summation for the case of probabilities. 

Note that nobody can prohibit us from assigning conditional probability distributions as potential 
functions and making (15.20) identical to Eq. (15.10); in this case, normalization is not explicitly 
required because each one of the conditional distributions is normalized. However, MRFs can deal 
with more general cases. 

Definition 15.9. We say that a Gibbs distribution p factorizes over an MRF H if each group of the 
variables x*, k = 1,2,..., K, involved in the K factors of the distribution p forms a complete subgraph 
of H. Every complete subgraph of an MRF is known as a clique, and the corresponding factors of the 
Gibbs distribution are known as clique potentials. 

Fig. 15.13A shows an MRF and two eliques. Note that the set of nodes {xi, X3, X4} does not com- 
prise a clique because the respective subgraph is not fully connected. The same applies to the set 
{x 1. X2, X3, X4}. In contrast, the sets {xi, X2, X3} and {X3, X4} form eliques. The fact that ali variables in 
a group x/ : that are involved in the respective factor fkixk) form a clique means that ali these variables 
mutually interact, and the factor is a measure of such an interaction/dependence. 

A clique is called maximal if we cannot include any other node from the graph in the set without 
it ceasing to be a clique. For example, both eliques in Fig. 15.13A are maximal eliques. On the other 
hand, the clique in Fig. 15.13B formed by {xi, X2, X3} is not maximal because bringing X4 into the new 
set {xi, X2, X3, X4} also gives a clique. The same holds true for the clique formed by {X2, X3, X4}. 
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(A) 


(B) 


FIGURE 15.13 


(A) There are two eliques, encircled by red lines. (B) There are as many possible eliques as the combinations of 
the points in pairs, in triples, and so on. Considering ali points together also forms a clique, and this is a maximal 
clique. 

15.4.1 INDEPENDENCIES AND l-MAPS IN MARKOV RANDOM FIELDS 

We will now state the equivalent theorem of cZ-separation, which was established for BN structures 
(recall the respective definition in Section 15.3.3), via the notion of an active chain. 

Definition 15.10. Let H be an MRF and let xi, X 2 ,..., x,t comprise a path. If Z is a set of observed 
variables/nodes, the path is said to be active given Z if none of x |, X2,..., Xf- is in Z. 

Given three disjoint sets, X, Y , Z, we say that the nodes of X are separated by the nodes of Y, 
given Z, if there is no active path between X and Y. given Z. Note that the previous definition is much 
simpler compared to the respective definition given for the BN structures. According to the current 
definition, for a set X to be separated from a set Y given a third set Z, it suffices that all possible 
paths from X to Y pass via Z. Fig. 15.14 illustrates the geometry. In 15.14A, there is no active path 
connecting the nodes in X from the nodes in Y given the nodes in Z. In 15.14B, there exist active paths 
connecting X and Y given Z. 

Let us now denote by I ( H ) the set of all possible statements of the type “A separated by Y given 
Z.” This is in analogy to the set of all possible d-separations associated with a BN structure. The 
following theorem (soundness) holds true (Problem 15.10). 

Theorem 15.4. Let p be a Gibbs distributiori that factorizes over an MRF H. Then this is an I-map 
for p, that is, 


i(H)qhp) 


( 15 . 22 ) 


This is the counterpart of Theorem 15.3 in its “I-map formulation” as introduced in Section 15.3.7. 
Moreover, given that an MRF FI is an I-map for a distribution p, p factorizes over H. Note that this 


3 


Because edges are undirected, the notions of “chain” and “path” become identical. 
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FIGURE 15.14 

(A) The nodes of X and Y are separated by the nodes of Z. (B) There exist active paths that connect the nodes of X 
with the nodes of Y, given Z. 


holds true for BN structures; indeed, if / (G) C I(p), then p factorizes over G (Problem 15.12). How- 
ever, for MRFs, it is only true for strictly positive Gibbs distributioris and it is given by the following 
Hammersley—Clifford theorem. 

Theorem 15.5. Let H be an MRF over a set ofrandom variables, xi,..., x/, described by a probability 
distribution, p > 0. If H is an I-mapfor p, then p is a Gibbs distributiori that factorizes over H. 

For a proof of this theorem, the interested reader is referred to the original paper [22] and also [5]. 
Our hnal touch on independencies in the context of MRFs concerns the notion of completeness. As 
was the case with the BNs, if p factorizes over an MRF, this does not necessarily establish complete¬ 
ness, although it is true for almost ali practical cases. However, the weaker version holds. That is, if x 
and y are two nodes in an MRF, which are not separated given a set Z, then there exists a Gibbs dis¬ 
tribution p which factorizes over H and according to which x and y are dependent, given the variables 
in Z (see, e.g., [25]). 

15.4.2 THE ISING MODEL AND ITS VARIANTS 

The origin of the theory on MRFs is traced back to the discipline of statistical physics, and since 
then it has extensively been used in a number of different disciplines, including machine learning. 
In particular, in image processing and computer vision, MRFs have been established as a major tool 
in tasks such as denoising, image segmentation, and stereo reconstruction (see, e.g., [29]). The goal 
of this section is to state a basic and rather primitive model, which, however, demonstrates the way 
information is captured and subsequently processed by such models. 
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FIGURE 15.15 

The graph of an MRF with pair-wise dependencies among the nodes. 


Assume that each random variable takes binary values in { — 1, 1} and that the joint probability 
distribution is given by the following model: 


1 

p(x\,...,xi) := p(x) = -exp 


E E' 


■ OtoXi 


\j>‘ 


(15.23) 


where 9jj = 0 if the respective nodes are not connected. It is readily seen that this model is the resuit 
of the product of potential functions (factors), each one of an exponential form, defined on eliques 
of size two. Also, 9(j = Oj, and we sum such that i < j in order to avoid duplication. This model was 
originally used by Ising in 1924 in his doctoral thesis to model phase transition phenomena in magnetic 
materials. The ±1 of each node in the lattice models the two possible spin directions of the respective 
atoms. If 9jj > 0, interacting atoms tend to align spins in the same direction in order to decrease energy 
(ferromagnetism). The opposite is true if 9jj < 0. The corresponding graph is given in Fig. 15.15. 

This basic model has been exploited in computer vision and image processing for tasks such as 
image denoising, image segmentation, and scene analysis. Let us take, as an example, a binarized 
image and let x, denote the noiseless pixel values (±1). Let y,- be the observed noisy pixels whose 
values have been corrupted by noise and have changed polarity; see Fig. 15.16 for the respective graph. 
The task is to obtain the noiseless pixel values. One can rephrase the model in Eq. (15.23) to the needs 
of this task and rewrite it as [5,21] 


1 

^(*|j0 = - exp 


+ p x >y' 




(15.24) 


where we have used only two parameters, a and fi. Moreover, the summation involves only 

neighboring pixels. The goal now becomes that of estimating the pixel values, x/, by maximizing 
the conditional (on the observations) probability. The adopted model is justified by the following two 
facts: (a) for low enough noise levels, most of the pixels will have the same polarity as the respective 
observations; this is encouraged by the presence of the product x;y,-, where similar signs contribute to 
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FIGURE 15.16 

A pair-wise MRF as in Fig. 15.15, but now the observed values associated with each node are separately denoted as 
red nodes. For the image denoising task, black nodes correspond to the noiseless pixel values (hidden variables) and 
red nodes to the observed pixel values. 


higher probability values; and (b) neighboring pixels are encouraged to have the same polarity because 
we know that real-world images tend to be smooth, except at the points that lie close to the edges in 
the image. Sometimes, a term cxj, for an appropriately chosen value of c, is also present if we want to 
penalize either of the two polarities. The max-product or max-sum algorithms, to be discussed later in 
this chapter, are possible algorithmic alternatives for the maximization of the joint probability given in 
Eq. (15.24). However, these are not the only algorithmic possibilities to perform the optimization task. 
A number of alternative schemes that deal with inference in MRFs have been developed and studied. 
Some of them are suboptimal, yet they enjoy computational efficiency. Some classical references on 
the use of MRFs in image processing are [7-9,47]. A number of variants of the basic Ising model resuit 
if one writes it as 


1 

P(x)= -exp 


X! (X! -A/ ( x i x j) +/;Oi) 


\j> i 


(15.25) 


and uses different functional forms for f)j (■, •) and /■(■), and also by allowing the variables to take 
more than two values. This is sometimes known as the Potts model. In general, MRF models of the 
general form of Eq. (15.23) are also known as pair-wise MRFs undirected graphs because the depen- 
dence among nodes is expressed in terms of products of pairs of variables. Further information on the 
applications of MRFs in image processing can be found in, for example, [29,40]. 
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Another name for Eq. (15.23) is Boltzmann distribution, where, usually, the variables take values in 
{0, 1}. Such a distribution has been used in Boltzmann machines [23]. Boltzmann machines can be seen 
as the stochastic counterpart of Hopfield networks; the latter have been proposed to act as associative 
memories as well as a way to attack combinatoric optimization problems (e.g., [31]). The interest in 
Boltzmann machines has been revived in the context of deep learning, and we will discuss them in 
more detail in Chapter 18. 


15.4.3 CONDITIONAL RANDOM FIELDS (CRFS) 

All the graphical models (directed and undirected) that have been discussed so far evolve around the 
joint distribution of the involved random variables and its factorization on a corresponding graph. More 
recently, there is a trend to focus on the conditional distribution of some of the variables given the rest. 
The focus on the joint PDF originates from our interest in developing generative learning models. 
However, this may not always be the most efficient way to deal with learning tasks, and we have already 
discussed in Chapters 3 and 7 about the discriminative learning alternative. Let us assume that from 
the set of the jointly distributed variables, some correspond to output target variables, whose variables 
are to be inferred when the rest are observed. For example, the target variables may correspond to the 
labeis in a classification task and the rest to the (input) features. 

Let us denote the former set by the vector y and the latter by x. Instead of focusing on the joint dis¬ 
tribution pix. y), it may be more sensible to focus on p(y\x). In [27], graphical models were adopted 
to encode the conditional distribution, /?(y|jc). 

A conditional random Markov field is an undirected graph H whose nodes correspond to the 
joint set of random variables (x, y), but we now assume that the conditional distribution is factorized, 
that is, 


p(y\x) 


Z(x) 


]~I Vr-CDt, y k ), 


k= 1 


(15.26) 


where {xk, y^l — I*’ jP k = 1,2,..., K, and 


Z(x) = j P (y\x)dy, (15.27) 

where for discrete distributions the integral becomes summation. To stress the difference with 
Eq. (15.20), note that there it is the joint distribution of all the involved variables that is factorized. 
As a resuit, and observing Eqs. (15.20) and (15.26), it turns out that the normalization constant is 
now a function of x. This seemingly minor difference can offer a number of advantages in practice. 
Avoiding the explicit modeling of pix), we have the benefit of using as inputs variables with complex 
dependencies, because we do not care to model them. This has led CRFs to be applied in a number 
of applications, such as text mining, bioinformatics, and computer vision. Although we are not go- 
ing to get involved with CRFs from now on, it suffices to say that the efficient inference techniques, 
which will be discussed in subsequent sections, can also be adapted, with only minor modifications, to 
the case of CRFs. For a tutorial on CRFs, including a number of variants and techniques concerning 
inference and learning, the interested reader is referred to [44]. 
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Xi x 2 



FIGURE 15.17 

The Gibbs distributiori can be written as a product of factors involving the eliques (xj, x 2 ), (xj, X3), (X3, X4), 
(x 3 ,x 2 ), (xi,x 4 ) or of (xi,X2,x 3 ), (xi,X3,x 4 ). 


15.5 FACTOR GRAPHS 

In contrast to a BN, an MRF does not necessarily indicate the specific form of factorization of the 
corresponding Gibbs distribution. Looking at a BN, the factorization evolves along the conditional 
distributions allocated in each node. Let us look at the MRF of Fig. 15.17. The corresponding Gibbs 
distribution could be written as 


1 

p(xi,X 2 ,X 3 ,X 4 ) = — ^i(.*l,X2)V f 2(M,*3)V0(-X3>-X2)lA4(X3,*4)V f 5(-Xl,.*4), ( 15 . 28 ) 


or 


1 

p(x 1, X2, X 3 , X4) = — t/q (x 1, X 2 , X3)t/r 2 (xi, X3, X 4 ). 


( 15 . 29 ) 


As an extreme case, if all the points of an MRF form a maximal clique, as is the case in Fig. 15.13B, we 
could include only a single product term. Note that aiming at maximal eliques reduces the number of 
factors, but at the same time the complexity is increased; for example, this can amount to an exponential 
explosion in the number of terms that have to be learned in the case of discrete variables. At the same 
time, using large eliques hides modeling details. On the other hand, smaller eliques allow us to be more 
explicit and detailed in our description. 

Factor graphs provide us with the means of making the decomposition of a probability distribu¬ 
tion into a product of factors more explicit. A factor graph is an undirected bipartite graph involving 
two types of nodes (thus the term bipartite): one that corresponds to the random variables, denoted 
by circles, and one that corresponds to the potential functions, denoted by squares. Edges exist only 
between two different types of nodes, that is, between “potential function” nodes and “variable” nodes 
[15,16,26]. 

Fig. 15.18A is an MRF for four variables. The respective factor graph in Fig. 15.18B corresponds 
to the product 


1 

p(xi,x 2 ,x 3 ,x 4 ) = —f ci (xi,x 2 , x 3 )i/f C2 (x 3 ,x 4 ), 


( 15 . 30 ) 


and the one in Fig. 15.18C to 


1 

p{x 1 , x 2 , x 3 , x 4 ) = — ^ cl (xi)ir C2 (xi, X2)fc 3 (xi,X2, X 3 )l/f C 4 (x 3 , x 4 ). 


( 15 . 31 ) 
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As an example, if the potential functions were chosen to express “interactions” among variables using 
probabilistic information, the involved functions in Fig. 15.18 may be chosen as 


and 


ir Cl (xl,X2,X3) — p{x-i\X\,X 2 )p{X2\x\)p{Xl) 


For the case of Fig. 15.18C, 


\[r C2 (x 3 ,x 4 ) = P(x 4 |x 3 ). 


(15.32) 

(15.33) 


fc i(jci) = pix i), fc 2 ixi,xi) = p{x 2 |xt) 

fc 3 ixl,X2,X3) = piX3\Xi,X 2 ), fc 4 iX3, X 4 ) = pix^Xj). 

For such an example, in both cases, it is readily seen that Z = 1. We will soon see that factor graphs 
turn out to be very useful for inference computations. 

Remarks 15.1. 


• A variant of the factor graphs, known as normal factor graphs (NFG), has been more recently 
introduced. In an NFG, edges represent variables and vertices represent factors. Moreover, latent 
and observable variables (internal and external) are distinguished by being represented by edges of 
degree 2 and degree 1, respectively. Such models can lead to simplified learning algorithms and can 
nicely unify a number of previously proposed models (see, e.g., [2,3,18,19,30,34,35]). 



FIGURE 15.18 


(A) An MRF and (B, C) possible equivalent factor graphs at different fine-grained factorization in terms of product 
factors. 
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15.5.1 GRAPHICAL MODELS FOR ERROR CORRECTING CODES 

Graphical models are extensively used for representing a class of error correcting codes. In the block 
parity-check codes (e.g., [31]), one sends k information bits (0, 1 for a binary code) in a block of N 
bits, N > k\ thus, redundancy is introduced into the system to cope with the effects of noise in the 
transmission channel. The extra bits are known as parity-check bits. For each code, a parity-check 
matrix, H, is defined; in order to be a valid one, for each code word, x, it must satisfy the parity-check 
constraint (modulo-2 operations), Hx = 0. Take as an example the case of k — 3 and N = 6. The code 
comprises 2 3 (2* in general) code words, each of them of length N — 6 bits. For the parity-check 
matrix. 


H = 


110 10 0 
10 10 10 
0 110 0 1 


the eight code words that satisfy the parity-check constraint are 000000, 001011, 010101, 011110, 
100110, 101101, 110011, and 111000. In each one of the eight words, the hrst three bits are the infor¬ 
mation bits and the remaining ones the parity-check bits, which are uniquely determined in order to 
satisfy the parity-check constraint. Each one of the three parity-check constraints can be expressed via 
a function, that is, 


in(X\,X 2 ,XA) — <5(X1 © X2 © X 4 ), 
ir 2 (xu X3,X5) — S(x 1 © XJ © X5), 

X3,X(,) — 8(X2 © X3 © X6), 

where <5(-) is equal to one or zero, depending on whether its argument is one or zero, respectively, 
and © denotes the modulo-2 addition. The code words are transmitted to a noisy memoryless binary 
symmetric channel, where each transmitted bit, x,, may be flipped over and be received as y,•, according 
to the following rule: 


P(y = 0|jc = 1) = p, P(y = l\x = 1) = 1 — p, 

P(y = 1 \x = 0) = p, P(y — 0|x = 0) = 1 — p. 

Upon reception of the observation sequence, yt, i = 1,2,..., N, one has to decide the value jc,- that 
was transmitted. Because the channel has been assumed memoryless, every bit is affected by the noise 
independently of the other bits, and the overall posterior probability of each code word is proportional 
to 

N 

i=1 

In order to guarantee that only valid code words are considered, and assuming equiprobable informa¬ 
tion bits, we write the joint probability as 

1 N 

P(X, y) = ~^l(Xi, X2, X4)^2(X\,X3, X5)\^3(X2, X 3 , Xe)Y\P(yi\Xi), 

i=l 
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fl(xi,X2,X i ) lh(.Xi,X3,Xs) iI> 3 (x2,X3,Xg) 



Factor graph for a (3,3) parity-check code. 


where the parity-check constraints have been taken into account. The respective factor model is shown 
in Fig. 15.19, where 

gi(ji,Xi) = P(ji\Xi). 

The task of decoding is to derive an efficient inference scheme to compute the posteriors, and based on 
that to decide in favor of 1 or 0. 


15.6 MORAUZATION OF DIRECTED GRAPHS 

At a number of points we have already made bridges between BNs and MRFs. In this section, we will 
formalize the bridge and see how one can convert a BN to an MRF and discuss the subsequent effects 
of such a conversion on the implied conditional independencies. 

We can trust cornmon sense to drive us to construet such a conversion. Because the conditional 
distributions will play the role of the potential functions (factors), one has to make sure that edges do 
exist among ali the involved variables in each one of these factors. Because edges from the parents to 
children exist, we have to (a) retain these edges and make them undirected and (b) add edges between 
nodes that are parents of a common child. This is shown in Fig. 15.20. In Fig. 15.20A, a DAG is 
shown, which is converted to the MRF of Fig. 15.20B by adding undirected edges between xi,X 2 
(parents of X3) and X3, X(, (parents of X5). The procedure is known as moralization and the resulting 
undirected graph as a moral graph. The terminology stems from the fact that “parents are forced to 
be married.” This conversion will be very useful soon, when an inference algorithm will be stated that 
covers both BNs and MRFs in a unifying framework. 

The obvious question that is now raised is how moralization affects independencies. It turns out 
that if H is the resulting moral graph, then I (H) C /(G) (Problem 15.1 1). In other words, the moral 
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Xl x 2 



Xl x 2 



FIGURE 15.20 

(A) A DAG and (B) the resulting MRF after applying moralization on the DAG. Directed edges become undirected 
and new edges, shown in red, are added to “marry” parents with common child nodes. 


graph can guarantee a smaller number of independencies compared to the original BN via its set of 
of-separations. This is natural because one adds extra links. For example, in Fig. 15.20A, xi, X 2 in the 
converging node xi —> X3 •<— X2 are marginally independent, not given X3. However, in the resulting 
moral graph in 15.20B, this independence is lost. It can be shown that moralization adds the fewest 
extra links and hence retains the maximum number of independence (see, for example, [25]). 


15.7 EXACT INFERENCE METHODS: MESSAGE PASSING ALGORITHMS 

This section deals with efficient techniques for inference on undirected graphical models. So, even if 
our starting point was a BN, we assume that it was converted to an undirected one prior to the inference 
task. We will begin with the simplest case of graphical models—graphs comprising a chain of nodes. 
This will help the reader to grasp the basic notions behind exact inference schemes. It is interesting to 
note that, in general, the inference task in graphical models is an NP-hard one [10]. Moreover, it has 
been shown that for general BNs, approximate inference to a desired number of digits of precision is 
also an NP-hard task [11]; that is, the time required has an exponential dependence on the number of 
digits of accuracy. However, as we will see in a number of cases that are commonly encountered in 
practice, the exponential growth in computational time can be bypassed by exploiting the underlying 
independencies and factorization properties of the associated distributions. 

The inference tasks of interest are (a) computing the likelihood, (b) computing the marginals of 
the involved variables, (c) computing conditional probability distributions, and (d) finding modes of 
distributions. 

15.7.1 EXACT INFERENCE IN CHAINS 

Let us consider the chain graph of Fig. 15.21 and focus our interest on computing marginals. The naive 
approach, which overlooks factorization and independencies, would be to work directly on the joint 
distribution. Let us concentrate on discrete variables and assume that each one of them, I in total, has 
K States. Then, in order to compute the marginal of, say, x ; , we have to obtain the sum 
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Xi X 2 x 3 Xz_ 2 X/_1 Xi 

o-o-O. o-o-o 

^ 1 , 2 (® 1 ,* 2 ) i>2,3(X2,X 3 ) ljjl-2,l-l{ x l-2i x l-l) (.Xl-1, Xl) 


FIGURE 15.21 

An undirected chain graph with / nodes. There are / — 1 eliques consisting of pairs of nodes. 


P ) : = Y, ■ ■ • 1] 1] ■ ■ ■ J2 P ’ ■ • • ’ x ‘ ) 

*1 Xj-lXj+l Xl 

= p (xi,...,xi). (15.34) 

Each summation is over K values; hence, the number of the required computations amounts to 0(K 1 ). 
Let us now bring factorization into the game and concentrate on computing P(x \). Assume that the 
joint probability factorizes over the graph. Hence, we can write 



(15.35) 


and 

1 /_1 

P(xi) =z ^ n^ + ito.*m). (15-36) 

xiii^l i =1 

Note that the only term that depends on x / is i/f/_i,/(x/_i, xf). Let us start by summing with respect 
to this last term, which leaves unaffected all the preceding factors in the sequence of products in 
Eq. (15.36), i.e.. 


1 

p <«>=z e n ^i,i+i(xi,Xi + i)^2 VT-uto-i.-*/), (15.37) 

xf.i^lj i=l xi 

where we exploited the basic property of arithmetic 

(15.38) 

i i 


• Define 

■.=ii b (xi-1 ). 

Xl 

Because the possible values of the pair (x/_i, . 17 ) comprise a table with K 2 elements, the summation 
involves K 2 terms and /x*(.t/_i) consists of K possible values. 
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• After marginalizing out x/, the only factor in the product that depends on xj-\ is 

^ 1 - 2 ,l-\(xi- 2 , xi- 

Then, in a similar way as before, we obtain 


1 '~ 3 

Pix i)=- e n fi,i+\(Xi,Xi+l) ^ fi-2,l-l(xi-2, Xl-i)Hb(xi-l), 


i= 1 

where this summation also involves K 2 terms. 


Xl x 2 X 3 Xl -2 X/_1 X/ 



(^)-(^)-(^). (^)-— Y^xi 


(^) -(^)- (^) . (*I—2) — Ex,_i V’l-2,l-l(*I-2i 




Q-- M6(^2) = E*3 i>2,3(X2,x 3 )lib(x 3 ) 


Q-4 — W(*l) = E* 2 


^(®l) = 7?«>(*l) 


FIGURE 15.22 

To compute P(xi), starting from the last node, x/, each node (a) receives a message and (b) processes it locally via 
sum and product operations, which produces a new message; and (c) the latter is passed backward, to the node on its 
left. We have assumed that iib(xi) = 1. 

We are now ready to define the general recursion as 


Hb(xi) := y^t/r(x,-,x i+ i)jttfc(x i+ i), i — 1,2 ,...,/ — 1, 

*>+t 

MfcO'/) = 1, 


(15.39) 


whose repeated application leads to 

Hb(x i) = y"]lh,2(*i,*2)w>(*2), 

xi 


and finally 


1 

P(x i) = —Hb(x\). 


(15.40) 
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The series of recursions is illustrated in Fig. 15.22. 

We can think that every node, x/, (a) receives a message from its right, /x/, (x,-), which for our case 
comprises K values, (b) performs locally sum-multiply operations and a new message - i) is 
computed, which (c) is passed to its left, to node x,_i. The subscript “ b” in /x/, denotes “backward” to 
remind us of the flow of the message passing activity from right to left. 

If we wanted to compute P(xi), we would adopt the same reasoning but start summation from jci. 
In this case, message passing takes place forward (from left to right) and messages are defined as 


F/(-T+1) := T. fi,i+i{xi,Xi + i)iX f{xi), i — 1— 1, 
Xi 

Vf(x l) = 1, 


where “/” has been used to denote “forward” flow. The procedure is shown in Fig. 15.23. 


(15.41) 


Xi X 2 X 3 X(_ 2 Xi_ 1 X; 



E tl ^l,2(31,2:2) = /*f(x 2 ) — ►(_) - (^) 

E X2 i>2,3( x 2, X 3 )iLf{x 2 ) =»f(x 3 ) —►Q 



£x ,_ 2 ll)l-2,l~l(xi-2,Xl-l)Vf(xi- 2 ) = tlffrl- 1) 




P(*l) 


1 

~z 




FIGURE 15.23 

To compute P{xi), message passing takes place in the forward direction, from left to right. As opposed to 
Fig. 15.22, messages are denoted as /x/, to remind us of the forward flow. 

The term jJLbixj) is the resuit of summing the products overx^+i, Xj+ 2 ,..., x/, and the term /x/(.*/) 
over X\,X 2 , At each iteration step, one variable is eliminated by summing up over ali its 

possible values. It can be easily shown, following similar arguments, that the marginal at any point, xj, 

2 < j < l — 1, is obtained by (Problem 15.13) 

P(xj) = ^fj,f(xj)fib(xj), j = 2, 3, ...,/- 1. (15.42) 

The idea is to perform one forward and one backward message passing operation, store the values, 
and then compute any one of the marginals of interest. The total cost will be 0(2K 2 !), instead of K 1 
of the naive approach. 

We stili have to compute the normalizing constant Z. This is readily done by summing up both 
sides of Eq. (15.42), which requires O(K) operations, 
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K 

Z =Y! (15.43) 

Xj = l 

So far, we have considered the computation of marginal probabilities. Let us now tum our attention 
to their conditional counterparts. We start with the simplest case, for example, to compute P(xj |x* = 
x k ), k ^ j. That is, we assume that variable x* has been observed and its value is Jf*. The first step in 
computing the conditional is to recover the joint P(x /, x* = x k ). This is a normalized version of the 
respective conditional, which can then be obtained as 

. P(Xj,X k =X k ) 

P(xj\x k = x k )= ■ (15.44) 

* \Xk) 

The only difference in computing P(xj,x k = x k ) compared to the previous computations of the 
marginals is that now, in order to obtain the messages, we do not sum with respect to x k . We just 
clump the respected potential function to its value x k . That is, the computations 

P-b(x k - 1) = XI r / f k-],k(x k -], X k )jl b {x k ), 

Xk 

Pf(x k+ 1 ) = ^2 'l'k,k+l( x k>Xk+i)tlf(Xk) 

Xk 


are replaced by 


Pb(x k -i) = ir k -\,k(xk-\,Xk)P-b(xk), 

Hf(Xk+ 1 ) = ifk,k+l(Xk,Xk+l)flf(Xk)- 

In other words, x k is considered a delta function at the instantiated value. Once P(xj,x k — x k ) has been 
obtained, normalization is straightforward and is locally performed at the j th node. The procedure can 
be generalized when more than one variable is observed. 

15.7.2 EXACT INFERENCE IN TREES 

Having gained experience and learned the basic secrets in developing efficient inference algorithms for 
chains, we tum our attention to the more general case involving tree-structured undirected graphical 
models. 

A tree is a graph in which there is a single path between any two nodes of the graph; thus, there are 
no cycles in the graph. Figs. 15.24A and B are two examples of trees. Note that in a directed tree, any 
node has only a single parent. A tree can be directed or undirected. Furthermore, because in a directed 
one there are no children with two parents, the moralization step, which converts a directed graph to 
an undirected one, adds no extra links. The only change consists of making the edges undirected. 

There is an important property of the trees, which will prove very important for our current needs. 
Let us denote a tree graph as T, which is a collection of vertices/nodes V and edges E, which link the 
nodes, that is, T = {V, E}. Consider any node x e V and consider the set of ali its neighbors, that is, 
ali nodes that share an edge with x. Denote this set as 

jV(x) = jyeV :(x, y)e£}. 
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FIGURE 15.24 

Examples of (A) directed and (B) undirected trees. Note that in the directed one, any node has a single parent. In 
both cases, there is only a single chain that connects any two nodes. (C) The graph is not a tree because there is a 
cycle. 


Looking at Fig. 15.24B, we have J\f(x) = {y, u, z, v}. Then, for each element r e AAx), define the 
subgraph T r — { V r , E r } such that any node in this subgraph can be reached from r via paths that do 
not pass through x. In Fig. 15.24B, the respective subgraphs, each associated with one element in 
(y, u, z, v), are encircled by dotted lines. By the definition of a tree, it can easily be deduced that each 
one of these subgraphs is also a tree. Moreover, these subgraphs are disjoint. In other words, each one 
of the neighboring nodes of a node, for example, x, can be viewed as a root of a subtree, and these 
subtrees are mutually disjoint, having no common nodes. This property will allow us to break a large 
problem into a number of smaller ones. Moreover, each one of the smaller problems can be further 
divided in the same way, being itself a tree. We now have ali the basic ingredients to derive an efficient 
scheme for inference on trees (recall that such a breaking of a large problem into a sequence of smaller 
ones was at the heart of the message passing algorithm for chains). However, let us hrst bring the notion 
of factor graphs into the scene. The reason is that using factor graphs allows us to deal with some more 
general graph structures, such as polytrees. 

A directed polytree is a graph in which, although there are no cycles, a child may have more than 
one parent. Fig. 15.25 A shows an example of a polytree. The unpleasant situation results after the 
moralization step because marrying the parents results in cycles, and we cannot derive exact infer¬ 
ence algorithms in graphs with cycles, Fig. 15.25B. However, if one converts the original directed 
polytree into a factor graph, the resulting bipartite entity has a tree structure, with no cycles involved, 
Fig. 15.25C. Thus, everything we said before about tree structures applies to these factor graphs. 

15.7.3 THE SUM-PRODUCT ALGORITHM 

We will develop the algorithm in a “bottom-up” approach, via the use of an example. Once the rationale 
is understood, the generalization can readily be obtained. Let us consider the factor tree of Fig. 15.26. 
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FIGURE 15.25 

(A) Although there are no cycles, one of the nodes has two parents, and hence the graph is a polytree. (B) The 
resulting structure after moralization has a cycle. (C) A factor graph for the polytree in (A). 


G \ /H 



FIGURE 15.26 

The tree is subdivided into three subtrees, each one having as its root one of the factor nodes connected to xi. This 
is the node whose marginal is computed in the text. Messages are initiated from the leaf nodes toward xi. Once 
messages arrive at xi, a new propagation of messages starts, this time from xi to the leaves. 


The factor nodes are denoted by capital letters and squares, and each one is associated with a potential 
function. The rest are variable nodes, denoted by circles. Assume that we want to compute the marginal 
P(x i). Node xi, being a variable node, is connected to factor nodes only. We split the graph in as many 
(tree) subgraphs as the factor nodes connected to xi (three in our case). In the figure, each one of these 
subgraphs is encircled, having as roots the nodes A, H, and G, respectively. Recall that the joint P(x) 
is given as the product of all the potential functions, each one associated with one factor node, divided 
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by the normalizing constant, Z. Focusing on the node of interest, xj, this product can be written as 

1 

P(x) = —x[r A (xi,XA)^H(xi,x H )ir G (xuXG), ( 15 . 45 ) 

where x A denotes the vector corresponding to ali the variables in T A ; the vectors x h and x G are 
similarly dehned. The function \j/ A (x\, x A ) is the product of all the potential functions associated with 
the factor nodes in T A , and \!/h(x\ . x h) and \fr G (xi, x G ) are dehned in an analogous way. Then the 
marginal of interest is given by 

p(xi) =^ E E E t. A(xi,x A )\lf H (x l ,x H )\lf G (x u x G ). ( 15 . 46 ) 

x A sV A x H £V H x G £V G 

We will concentrate on the subtree with root A, denoted as T A := { V A , E A ), where V A stands for the 
nodes in T A and E A for the respective set of edges. Because the three subtrees are disjoint, we can split 
the previous expression in Eq. (15.46) into 

/>(«) = 4 Z i/ A (x],x A ) Y '1'h(x\,xh) Y fG{x\,x G ). ( 15 . 47 ) 

x A eV A x h sVh x G £V g 

Note that \\ £V A VJ Vh U V g . Having reserved the Symbol \j/ A (•, •) to denote the product of all the po- 
tentials in the subtree T A (and similarly for Th and Tq), let us denote the individual potential functions, 
for each one of the factor nodes, via the Symbol /, as shown in Fig. 15.26. Thus, we can now write 

Y '!' A {X\,X A )= Y fa(xi,X2,X3,X4)f c (X3)fb(X2,X5,X6) 

x A eV A x A eV A 

X f d (X4, X 7 , X S )f e (X4, X 9 , Xio) 

= EEE fa(xi,X2,X3,X 4 )fc(x 3 ) EE fd(x 4,X7,X 8 ) 

X2 *3 M *7 Xs 

-EE fe(x 4 ,X 9 ,Xlo) EE^-* 5 ^)- ( 15 . 48 ) 

X9 *10 *6 *5 

Recall from our treatment of the chain graph that messages were nothing but locally computed sum¬ 
matioris over products. Having this experience, let us define 

ftfb-t-X2 (^ 2 ) = EE fb(x 2,X5,X 6 ), 

X6 X 5 

4 (^ 4 ) = EE fe(.X 4,X 9 ,Xiq), 

X9 *10 

Pfd^xM 4 ) = EE fd(X4,X 7 ,Xb), 

XI 

Pfc^x 3 (x 3 ) = /c(x 3 ), 
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M.Y4 —> f a (-^ 4 ) — /3 fd —► a '4 (X4 ) fX f e —► 44 i .%4 ) 1 

M.Y2 ->f a (X2) = M fb^x 2 fe). 

M-V3 —> f a (-Vi) = t^f c —>X 3(^-3)' 


and 


M/ a ^.Y 1 Ul) = EEE fa(x\,X2, X3,X4)f2 X2 ^f a (x 2 )fX x C*3) ^x A -+ f a (xa)- 

X2 X3 Y 4 

Observe that we were led to define two types of messages; one type is passed from variable nodes to 
factor nodes and the other type is passed from factor nodes to variable nodes. 

• Variable node to factor node messages (Fig. 15.27 A): 


Hx^f(x)= ]""[ iif s ^ x (x). 
s:f s eAf(x)\f 


(15.49) 


We use A f (x) to denote the set of the nodes with which a variable node x is connected; AAx) \ / 
refers to all nodes excluding the factor node /; note that all these nodes are factor nodes. In other 
words, the action of a variable node, as far as message passing is concerned, is to multiply incoming 
messages. Obviously, if it is connected only to one factor node (except /), then such a variable 
node passes what it receives, without any computation. This is, for example, the case with [x X2 ->f a , 
as previously defined. 

• Factor node to variable node messages (Fig. 15.27B): 


llf^x(x)= ^2 /( x/ ) 

I”I A 4 *,—»■/(■*/). 

xjeA f(/)\x 

:xj eAf (f)\x 


(15.50) 



FIGURE 15.27 

(A) Variable x is connected to S factor nodes, besides /; that is, J\f(x) \ f = {/), / 2 , ..., fs}- The output mes¬ 
sage from x to / is the product of the incoming messages. The arrows indicate directions of flow of the message 
propagation. (B) The factor node f is connected to S node variables, besides x; that is, M(f) \ x = {xi, xj,..., xs). 








808 CHAPTER 15 PROBABILISTIC GRAPHICAL MODELS: PART I 


where A f (f) denotes the set of the (variable) nodes connected to / and Af(f) \ x the corresponding 
set if we exclude node x. The vector x* comprises ali the variables involved as arguments in /, that 
is, ali the variables/nodes in Af(f). 

If a node is a leaf, we adopt the following convention. If it is a variable node, x, connected to a factor 
node, /, then 

/Tt^/0)=1. (15.51) 

If it is a factor node, /, connected to variable node, x, then 

M/-*(jc) = /(*). (15.52) 

Adopting the previously stated definitions, Eq. (15.48) is now written as 

^a(xi,x a ) 

x A eV A 

= EEE fa(x 1 ,X2,X3, X 4 )fl x2 ^. f a (X2 )Hx 3 ^ f a (xz)fl X4 -►/„ (- x 4 ) 

X 2 X3 X4 

= /x.f a ^x l (x 1 ). ( 15 . 53 ) 

Working similarly for the other two subtrees, Tq and Th . we finally obtain 

1 

P(x 1 ) = —ixf a ^ xi (xi)ix fg ^ xi (xi)fxf h ^ xi (x 1 ). (15.54) 

Note that each summation can be viewed as a step that “removes” a variable and produces a message. 
This is the reason that sometimes this procedure is called variable elimination. We are now ready to 
summarize the steps of the algorithm. 

The algorithmie steps 

1. Pick the variable x, whose marginal, P(x), will be computed. 

2. Divide the tree into as many subtrees as the number of factor nodes to which the variable node x is 
connected. 

3. For each one of these subtrees, identify the leaf nodes. 

4. Start message passing, toward x, by initializing the leaf nodes according to Eqs. (15.51) and (15.52), 
by utilizing Eqs. (15.49) and (15.50). 

5. Compute the marginal according to Eq. (15.54), or in general 

n ( i5 - 55 > 

s:f s eAf(x) 

The normalizing constant, Z, can be obtained by adding both sides of Eq. (15.55) over ali possible 
values of x. As was the case with the chain graphs, if a variable is observed, then we replace the 
summation over this variable, in the places where this is required, by a single term evaluated on the 
observed value. 
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Although so far we have considered discrete variables, everything that has been said also applies to 
continuous variables by substituting summations by integrals. For such integrations, Gaussian models 
turn out to be very convenient. 

Remarks 15.2. 

• Thus far, we have concentrated on the computation of the marginal for a single variable, x. If the 
marginal of another variable is required, the obvious way would be for the whole process to be 
repeated. However, as we already commented in Section 15.7.1, such an approach is computation- 
ally wasteful because many of the computations are common and can be shared for the evaluation 
of the marginals of the various nodes. Note that in order to compute the marginal at any variable 
node, one needs ali the messages from the factor nodes, to which the specific variable node is con- 
nected, to be available (Eq. (15.55)). Assume now that we pick one node, say, xj, and compute the 
marginal, once all the required messages have “arrived.” Then this node initiates a new message 
propagation phase, this time toward the leaves. It is not difficult to see (Problem 15.15) that this 
process, once completed, will make available to every node all the required messages for the com¬ 
putation of the respective marginals. In other words, this two-stage message passing procedure, in 
two opposite-flow directions, suffices to provide all the necessary information for the computation 
of the marginals at every node. The total number of messages passed is just twice the number of 
edges in the graph. Similar to the case of chain graphs, in order to compute conditional probabil- 
ities, say, P (x; |x<. — Xk), node x/, has to be instantiated. Running the sum-product algorithm will 
provide the joint probability P(xj , X/, = Xk), from which the respective conditional is obtained after 
normalization; this is performed locally at the respective variable node. 

• The (joint) marginal probability of all the variables xi, X 2 ,..., x$, associated with a factor node /, 
is given by (Problem 15.16) 


1 T—T 

P(x t, ...,x s ) = — f(xi ,. ..,.r 5 )] _ [/x^^/(x J ). (15.56) 

5=1 

• Earlier versions of the sum-product algorithm, known as belief propagation, were independently 
developed in the context of singly connected graphs in [28,36,37]. However, the problem of variable 
elimination has an older history and has been discovered in different communities (e.g., [4,6,39]). 
Sometimes, the general sum-product algorithm, as described before, is also called the generalized 
forward-backward algorithm. 

15.7.4 THE MAX-PRODUCT AND MAX-SUM ALGORITHMS 

Let us now turn our attention from marginals to modes of distributions. That is, given a distribution 

P(x) that factorizes over a tree (factor) graph, the task is to compute efficiently the quantity 

max P(x). 

X 

We will focus on discrete variables. Following similar arguments as before, one can readily write the 

counterpart of Eq. (15.46), for the case of Fig. 15.26, as 
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1 

maxP(jc) = — max max max max xhWg&i, xq)- (15.57) 

x Z -vi x A eV A x H eVH x G eV G 


Exploiting the property of the max operator, that is, 


max(ab, ac) = a max(i>, c) , a > 0 , 

b,c b,c 


we can rewrite Eq. (15.57) as 


1 

max P(x) — — max max x[t a (x\,Xa) max i/th(xi,xh) max ir G (xi, jcc). 
V Z XI x A eV A x H eV H x G eV G 


Following similar arguments as for the sum-product rule, we arrive at the counterpart of Eq. (15.48), 
that is, 


max max ir A (xi, x a ) = max max f a (x\, X 2 , * 3 , X 4 ) f c (xj )max f c /(X 4 , X 7 , xg) 
xi x A eV A xi v 2 ,v3,.v4 .V7,v 8 

x max / e (x4,x 9 ,xio)max/fc(x2,X6,x 5 ). (15.58) 

* 9>*10 x 5 ,x 6 

Eq. (15.58) suggests that everything that was said before for the sum-product message passing algo- 
rithm holds true here, provided we replace summations with the max operations, and the definitions of 
the messages passed between nodes change to 

M*-+/(*)= n V-f s ^x(x), (15.59) 

s:f s eAf(x)\f 


and 


V'f->x(x)= max f(x f ) F[ p, Xi ^ f (xj), 
Xi :xi eAf (/)\x . 

/:v,sA/(/)\x 


(15.60) 


with the same definition of symbols as for Eq. (15.50). Then the mode of P(x) is given by 


1 

max P(x) = — mx/i/^j, {x\)p, fg ^ xi (xi)fx fh -+ xi (x 1 ), 


(15.61) 


or in general 

max P{x) = ^~ max Ff p,f^. x {x), (15.62) 

V Z X 1 x 

s:f s eJ\f(x) 

where x is the node chosen to play the role of the root, toward which the flow of the messages is 
directed, starting from the leaves. The resulting scheme is known as the max-product algorithm. 

In practice, an alternative formulation of the previously stated max-product algorithm is usually 
adopted. Often, the involved potential functions are probabilities (by absorbing the normalization con¬ 
stant) and their magnitude is less than one; however, if a large number of product terms is involved 
it may lead to arithmetic inaccuracies. A way to bypass this is to involve the logarithmic function, 
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which transforms products into summatioris. This is justified by the fact that the logarithmic function 
is monotonic and increasing; hence it does not affect the point x at which a maximum occurs, that is, 

jc* := argmax P(.r) = argmax \nP(x). (15.63) 

X X 

Under this formulation, the following max-sum version of the algorithm results. It is straightforward to 
see that Eqs. (15.59) and (15.60) now take the form of 



s:f s eAf(x)\f 


max 

xf.XieAf (/)\x 

ln / (x f ) + ^2 M.t, -+ / (x :) 1 . 

i: Xi eAT(f)\x J 


(15.64) 

(15.65) 


In place of Eqs. (15.51) and (15.52) for the initial messages, sent by the leaf nodes, we now define 

M.r^/00 = 0 and (i f -> x (x) = ln f(x). (15.66) 

Note that after one pass of the message flow, the maximum value of P(x) has been obtained. However, 
one is also interested in knowing the corresponding value jc*, for which the maximum occurs, that is, 

jc* = argmax P(jc). 

X 

This is achieved by a reverse message passing process, which is slightly different from what we have 
discussed so far, and it is known as back-tracking. 

Back-tracking: Assume that xi is the chosen node to play the role of the root, where the flow of 
messages “converges.” From Eq. (15.61), we get 

xi* = argmax (i fa ^ Xl (xi)fu,f g ^ Xl (xi)nf h ^ Xl (xi). (15.67) 

Xl 

A new message passing flow now starts, and the root node, xi, passes the obtained optimal value to the 
factor nodes to which it is connected. Let us follow this message passing flow within the nodes of the 
subtree 7',\. 

• Node A: It receives x\* from node xi. 

- Selection of the optimal values: Recall that 

P-fa^-x i(*l)= max f a (xi,X2,X3,X4)fi M ->f a (X4) 
x Vx 3 ^f a (X3)lX X2 ^f a (X2). 

Thus, for different values of x\, different optimal values for (X 2 , xj,, X 4 ) will resuit. For ex- 
ample, assume that in our discrete variable setting, each variable can take one out of four 
possible values, that is, x e {1,2, 3, 4}. Then, if x\* = 2, say that the resulting optimal values 
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are (X 2 *, X 3 *, x 4 *) = (1, 1,3). On the other hand, if xi* = 4, then maximization may resuit to, 
let us say, (x 2 *, X 3 *, X 4 *) = (2, 3, 4). However, having obtained a specifc value for xi* via the 
maximization at node xj, we choose the triplet (X 2 *, X 3 *, x 4 *) such that 

(X2*,x 3 *,x 4 *) = arg max /„(xi*, x 2 , x 3 , X4)At^ 4 ^/ 0 (x 4 ) 

X2,X3,X4 

x Vx 3 ^f a (X3)ll X2 ^.f a (X2). (15.68) 

Hence, during the hrst pass, the obtained optimal values have to be stored to be used during the 
second (backward) pass. 

- Message passing: Node A passes x 4 * to node x 4 , X 2 * to node X 2 , and X 3 * to node X3. 

• Node x 4 passes x 4 * to nodes D and E. 

• Node D 

- Selection of the optimal values: Select (x 7 *, x 8 *) such that 

(x 7 *,x 8 *) = arg max fd (x 4 *, x 7 , x 8 )/x. T7 ^ (x~i)ii Xi ^f d (x 8 ). 

- Message Passing: Node D passes (x 7 *. x 8 *) to nodes x 7 and x 8 , respectively. 

This type of flow spreads toward ali the leaves, and finally 

x* = argmax P(x) (15.69) 

X 

is obtained. One may wonder why not use a similar two-stage message passing as we did with the 
sum-product rule and recover x,-* for each node i. This would be possible if there were a guarantee for 
a unique optimum, x*. If this is not the case and we have two optimal values, say, x* and x*, which 
resuit from Eq. (15.69), then we run the danger of failing to obtain them. To see this, let us take an 
example of four variables, xi, X 2 , X 3 , x 4 , each taking values in the discrete set {1,2, 3, 4}. Assume that 
P(x) does not have a unique maximum and the two combinations for optimality are 


(xi*,X2*,X3*,X4*) = (1, 1,2, 3) 


(15.70) 


and 


(xi*,X2*,X3*,x 4 *) = (1,2, 2,4). (15.71) 

Both of them are acceptable because they correspond to max P(x\, x\, X3, X4). The back-tracking 
procedure guarantees to give either of the two. In contrast, using two-stage message passing may resuit 
to a combination of values, for example. 


(xi*, x'2*, X3*, x 4 *) = (1, 1,2, 4), (15.72) 

which does not correspond to the maximum. Note that this resuit is correct in its own rationale. It 
provides, for every node, a value for which an optimum may resuit. Indeed, searching for a maximum 
of P(x), node X 2 can take either the value of 1 or 2. However, what we want to hnd is the correct 
combination for all nodes. This is guaranteed by the back-tracking procedure. 
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Remarks 15.3. 

• The max-product (max-sum) algorithm is a generalization of the celebrated Viterbi algorithm [46], 
which has extensively been used in Communications [17] and speech recognition [39]. The algo¬ 
rithm has been generalized to arbitrary commutative semirings on tree-structured graphs (e.g., [1, 
13]). 


X w 





FIGURE 15.28 

(A) The Bayesian network of Example 15.4; (B) its moralized version, where the two parents of (|> have been con- 
nected; and (C) a possible factor graph. 

Example 15.4. Consider the BN of Fig. 15.28A. The involved variables are binary, [0, 1}, and the 
respective probabilities are 

P(x= 1) = 0.7, P(x = 0) = 0.3, 

P( w = 1) = 0.8, P( w = 0) = 0.2, 

P(y= l|x = 0) = 0.8, P(y — 0|x = 0) = 0.2, 

P(y = l|x = 1) = 0.6, P(y = 0|x = 1) = 0.4, 

P(z = l|y = 0) = 0.7, P(z = 0|y = 0) = 0.3, 

P(z = 11y = 1) = 0.9, P(z = 0|y = 1) = 0.1, 

/»(4) = l|x = 0,w = 0) = 0.25, P(cj) = 0|x = 0,w = 0) = 0.75, 

P(<]> = l|x= l,w = 0) = 0.3, P(i |) = 0|x= 1,w = 0) = 0.7, 

P(t|> = l| x = 0, w = 1) = 0.2, P(t[) = 0|x = 0, w = 1) = 0.8, 

p*(4> = l|x= l,w= 1) = 0.4, P(§ — 0|x = 1, w = 1) = 0.6. 

Compute the combination x x ,y x ,z*,4>*,w x , which results in the maximum of the joint probability 

P(x, y, z, (p, w ) = P(z\y, x, (p, uOBCylx, </>, w)P(<p\x, w)P(x\w)P(w) 

= P(z\y)P(y\x)P(cp\x, w)P(x)P(w), 

which is the factorization imposed by the BN. 
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In order to apply the max-product rule, we first moralize the graph and then form a factor graph 
version, as shown in Figs. 15.28B and C, respectively. The factor nodes realize the following potential 
(factor) functions: 

fa(x,y)= P(y\x)P(x), 
fb(y,z) = P(zly), 

fc(f, X, w) — P((p\x, w)P(w ), 

and obviously 

P(x, y, z, </>, W ) = f a (x, y)fb(y, z)f c (4>, x, w). 

Note that in this case, the normalizing constant Z = 1. Thus, the values these factor functions take, 
according to their previous definitions, are 


f a (x,y ) 


/a(l> 1) = 0.42, 
/a(l. 0) = 0.28, 

/a(0, 1) = 0.24, 

fa (0,0) = 0.06, 


fb(y,z ) 


fb( 1,1) = 0.9, 
/*( 1 , 0 ) = 0 . 1 , 
/t>(0,1) = 0.7, 
/,(0,0) = 0.3, 


fc(<t>-X, w) 


/ c (l, 1,1) = 0.32, 
f c ( 1,1,0) = 0.06, 
fcd, 0,1) = 0.48, 
/c(l, 0,0) = 0.14, 

/c(0, 1,1) = 0.16, 

fc( 0, 1,0) = 0.05, 
fc(! 0- 0,1) = 0.64, 
/ c (0, 0,0) = 0.15. 


Note that the number of possible values of a factor explodes by increasing the number of the involved 
variables. 

Application of the max-product algorithm: Choose as root the node x. Then the nodes z, 4>, and w 
become the leaves. 

• Initialization: 


/x z -^f b (z)=l, M0^/ c W>)= 1, ix w ^-f c (w)= 1. 

• Begin message passing: 

- fb^r- 

l^fb^ydy) = maxfbdy, z)fi z -+f b dz), 

or 


= 0.9, 0) = 0.7, 


where the first one occurs at z = 1 and the second one at z = 1. 
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fa■ 


or 




- fa 


x: 


or 


— 0.9, My^/ fl (0) — 0.7. 

M/ a -»*(•*) = max f a [x , y)n y ^ / a (y). 


H fa -+A 1) = 0.42-0.9 = 0.378, 

which occurs for ;y = 1. Note that for y = 0, the value for /x f a -+ x (1) would be 0.7 • 0.28 = 0.196, 
which is smaller than 0.378. Also, 


H f a ^ x (0) = 0.24 • 0.9 = 0.216, 


which also occurs for y = 1 . 
fc —*■ x: 


fj.f c ->x(x) = max f c ((j>, x, w)/x w -> / C O)/X 0 ^ fc (</>), 

W,(f) 


or 


M/ C ^x(l) = 0.48, 

which occurs for 0 = 0 and w = 1 , and 

M/ c ^.y(0) = 0.64, 


which occurs for 0 = 0 and w — 1 . 
- Obtain the optimal value: 


•x* = argmax^/^OO/x/^Cx), 


or 


x* = 1, 


and the corresponding maximum value is 


maxP(.y, y, z, w, 0) = 0.378 • 0.48 = 0.1814. 


• Back-tracking: 

- Node f c : 

max/ c (l, 0 , w)/x u ,^/ c (w)/X 0 ^/ f ( 0 ), 

W,(J) 
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which has occurred for 


- Node f a : 


which has occurred for 


<p* = 0 and xv* — 1 . 


max/ a (l, y)fx y ^ fa (y), 


- Node fb'. 


which has occurred for 


>'*=!• 


max //,(1, z)lJ. z ^f b (z), 


z* = 1 . 


Thus, the optimizing combination is 


( x *, y*, z*, </>*, xv*) = ( 1 , 1 , 1 , 0 , 1 ). 


PROBLEMS 

15.1 Show that in the product 

n 

J - [(1 -Xi), 

i —! 

the number of cross-product terms, x\,X 2 ,,Xk, 1 <k<n, for all possible combinations of 
x\,... ,x n is equal to 2 " — n — 1 . 

15.2 Prove that if a probability distribution p satisfies the Markov condition, as implied by a BN, 
then p is given as the product of the conditional distributions given the values of the parents. 

15.3 Show that if a probability distribution factorizes according to a BN structure, then it satisfies 
the Markov condition. 

15.4 Consider a DAG and associate each node with a random variable. Define for each node the 
conditional probability of the respective variable given the values of its parents. Show that the 
product of the conditional probabilities yields a valid joint probability and that the Markov 
condition is satisfied. 

15.5 Consider the graph in Fig. 15.29. Random variable x has two possible outcomes, with proba¬ 
bilities P(xi) = 0.3 and P(x 2 ) = 0.7. Variable y has three possible outcomes, with conditional 
probabilities 


P(yt|xt) = 0.3, P(ji\x\) = 0.2, PCbIm) = 0.5, 
P(yi\x2) = 0.1, P(y 21 * 2 ) = 0.4, P(y 3 |^ 2 ) = 0.5. 
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0 - 0-0 

x y z 


FIGURE 15.29 

Graphical model for Problem 15.5. 


Xi x 2 x 9 



FIGURE 15.30 

DAG for Problem 15.6. Nodes in red have been instantiated. 


Finally, the conditional probabilities for z are 


P(zi|yi) = 0.2, Pfolyi) = 0.8, 

P(zi\yi) = 0.2, P(z 2 \yi) = 0.8, 

PCzilw) = 0.4, P(z2\y3) = 0.6. 

Show that this probability distribution, which factorizes over the graph, renders x and z inde- 
pendent. However, x and z in the graph are not c/-separated because y is not instantiated. 

15.6 Consider the DAG in Fig. 15.30. Detect the rZ-separations and rZ-connections in the graph. 

15.7 Consider the DAG of Fig. 15.31. Detect the blanket of node X 5 and verify that if all the nodes 
in the blanket are instantiated, then the node becomes rZ-separated from the rest of the nodes in 
the graph. 

15.8 In a linear Gaussian BN model, derive the mean values and the respective covariance matrices 
for each one of the variables in a recursive manner. 

15.9 Assuming the variables associated with the nodes of the Bayesian structure of Fig. 15.32 to be 
Gaussian, find the respective mean values and covariances. 

15.10 Prove that if p is a Gibbs distribution that factorizes over an MRF H , then H is an I-map for p. 

15.11 Show that if H is the moral graph that results from moralization of a BN structure, then 


I (H) c I (G). 
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Xi x 2 X 3 



FIGURE 15.31 

The graph structure for Problem 15.7. 


Xl x 2 x 3 

0 - 0-0 


FIGURE 15.32 

NetWork for Problem 15.9. 


15.12 Consider a BN structure and a probability distribution p. Show that if I (G) C I(p), then p 
factorizes over G. 

15.13 Show that in an undirected chain graphical model, the marginal probability P(xj ) of a node xj 
is given by 

1 

P(xj) = —Hf(xj)ii h (Xj), 

where \x f(xj) and pibixj) are the forward and backward messages received by the node. 

15.14 Show that the joint distribution of two neighboring nodes in an undirected chain graphical 
model is given by 

1 

P{Xj,Xj+ 1 ) = — IAf(Xj)\lfjj + l(Xj,Xj + i)Pb(.Xj+\)- 

15.15 Using Fig. 15.26, prove that if there is a second message passing, starting from x, toward the 
leaves, then any node will have the available information for the computation of the respective 
marginals. 

15.16 Consider the tree graph of Fig. 15.26. Compute the marginal probability P(x i, X 2 , a' 3 - X 4 ). 

15.17 Repeat the message passing procedure to find the optimal combination of variables for Exam- 
ple 15.4 using the logarithmic version and the max-sum algorithm. 
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16.1 INTRODUCTION 

This is the follow-up to Chapter 15 and it builds upon the notions and models introduced there. The 
emphasis of this chapter is on more advanced topics for probabilistic graphical models. It wraps up the 
topic of exact inference in the context of junction trees and then moves on to introduce approximate 
inference techniques. This establishes a bridge with Chapter 13. Then, dynamic Bayesian networks are 
introduced, with an emphasis on hidden Markov models (HMMs). Inference and training of HMMs 
is seen as a special case of the message passing algorithm that was introduced in Chapter 15 and the 
EM scheme discussed in Chapter 12. Finally, the more general concept of training graphical models is 
briefly discussed. 
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16.2 TRIANGULATED GRAPHS AND JUNCTION TREES 

In Chapter 15, we discussed three efficient schemes for exact inference in graphical entities of a tree 
structure. Our focus in this section is on presenting a methodology that can transform an arbitrary graph 
into an equivalent one having a tree structure. Thus, in principle, such a procedure offers the means 
for exact inference in arbitrary graphs. This transformation of an arbitrary graph to a tree involves a 
number of stages. Our goal is to present these stages and explain the procedure more via examples and 
less via formal mathematical proofs. A more detailed treatment can be obtained from more specialized 
sources (for example, [32,45]). 

We assume that our graph is undirected. Thus, if the original graph was a directed one, then it is 
assumed that the moralization step has previously been applied. 

Definition 16.1. An undirected graph is said to be triangulated if and only if for every cycle of length 
greater than three the graph possesses a chord. A chord is an edge joining two nonconsecutive nodes 
in the cycle. 

In other words, in a triangulated graph, the largest "minimal cycle” is a triangle. Fig. 16.1 A shows 
a graph with a cycle of length n = 4 and Figs. 16. 1B and C show two triangulated versions; note that 
the process of triangulation does not lead to unique answers. Fig. 16. 2A is an example of a graph with 
a cycle of n — 5 nodes. Fig. 16. 2B, although it has an extra edge joining two nonconsecutive nodes, is 
not triangulated. This is because there stili remains a chordless cycle of four nodes (X2 — X3 — X4 — X5). 
Fig. 16.2C is a triangulated version. There are no cycles of length n > 3 without a chord. Note that by 
joining nonconsecutive edges in order to triangulate a graph, we divide it into eliques (Section 15.4); 
we will appreciate this very soon. Figs. 16. 1B and C comprise two (three-node) eliques and Fig. 16. 2C 
comprises three eliques. This is not the case with Fig. 16. 2B, where the subgraph (X2, X3, X4, X5) is not 
a clique. 

Let us now see how the previous definition relates to the task of variable elimination, which under- 
lies the message passing philosophy. In our discussion on such algorithmic schemes, we started with a 
node and marginalized out the respective variable (e.g., in the sum-product algorithm); as a matter of 
fact, this is not quite true. The message passing was initialized at the leaves of the tree graphs; this was 
done on purpose, although not explicitly stated there. We will soon realize why. 

Consider Fig. 16.3 and let 

P(X ) := tyl C*l)tM*l. X2)fo(. Xl, *3)lM*2, Xa)\It 5 {X2, X3, X 5 )^ 6 (X3,X6), (16.1) 


Xi X4 Xl X4 Xl X4 



FIGURE 16.1 


(A) A graph with a cycle of length n = 4. (B, C) Two possible triangulated versions. 
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Xi Xl Xl 





FIGURE 16.2 

(A) A graph of cycle of length n = 5. (B) Adding one edge stili leaves a cycle of length n = 4 chordless. (C) A tri- 
angulated version; there are no cycles of length n > 3 without a chord. 



FIGURE 16.3 

(A) An undirected graph with potential (factor) functions xfr\(xi), 1 A 2 G 1 , X 2 ), ifs(xi,xs), V r 4( JC 2> M)> 
ij/s(x 2 , X 3 , X 5 ), and tA 6 (-'-' 3 > ^ 6 ). (B) The graph resulting after the elimination of x$. (C) The graph that would have 
resulted if the flrst node to be eliminated were X3. Observe the fill-in edges denoted by red. (D-F) are the graphs that 
would resuit if the elimination process had continued from the topology shown in (B) and sequentially removing the 
nodes: X5, X3, and finally xj. 


assuming that Z — 1. 

Let us eliminate X(, first, that is, 


= l/r (1) (x i,X2,X3,X4,Xs)^l/f6(X3,X6) 

X6 X6 

= llf {1 Hxi,X2,X3,X4,X5)\lf & (X3), 


( 16 . 2 ) 
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where the definitioris of i // (11 and i/A 3) are self-explained, by comparing Eqs. (16.1) and (16.2). The 
resuit of elimination is equivalent to a new graph, shown in Fig. 16. 3B with P(x) given as the product 
of the same potential functions as before with the exception of 1//3 , which is now replaced by the product 
1^3 (xi , jC 3 )t/r (3) (x 3 ). Basically, f (3) (f is the message passed to X3. 

In contrast, let us now start by eliminating X3 first. Then we have 

P(x) = X 2 , Xa) T. , X 3 )\l/ 5 (x 2 , X 3 , xf)-]/^X 3 , X(,) 

x 3 x 3 

= if ( 2 ) (xi,X2,X4)xjf {? ’ ) (xi,X2,X5,X(,). 

Note that this summation is more difficult to perform. It involves four variables (xi. X2. X5, xf) be- 
sides X3, which requires many more combination terms be computed than before. Fig. 16. 3C shows 
the resulting equivalent graph, after eliminating X3. Due to the resulting factor f i 3 ) (x\ , X 2 , X5, xf) new 
connections implicitly appear, known as fill-in edges. This is not a desired situation, as it introduces 
factors depending on new combinations of variables. Moreover, the new factor depends on four vari¬ 
ables, and we know that the larger the number of variables, or the domain of the factor as we say, the 
larger the number of terms involved in the summations, which increases the computational load. 

Thus, the choice of the sequence of elimination is very important and far from innocent. For exam- 
ple, for the case of Fig. 16. 3A an elimination sequence that does not introduce fill-ins is the following: 
X5, X5, X3, xi, X2, X4. For such an elimination sequence, every time a variable is eliminated, the new 
graph results from the previous one by just removing one node. This is shown by the sequence of 
graphs in Figs. 16.3 A, B, and D-F, for the case of the previously given elimination sequence. An 
elimination sequence that does not introduce fill-ins is known as a perfect elimination sequence. 

Proposition 16.1. An undirected graph is triangulated if and only if it has a perfect elimination se¬ 
quence (for example, [32]). 

Definition 16.2. A tree T is said to be a join tree if (a) its nodes correspond to the eliques of an 
(undirected) graph G and (b) the intersection of any two nodes, U fi V, is contained in every node 
in the unique path between U and V. The latter property is also known as the running intersection 
property. 

Moreover, if a probability distribution p factorizes over G so that each of the product factors (po¬ 
tential functions) is attached to a clique (i.e., depends only on variables associated with the nodes in 
the clique), then the join tree is said to be a junction tree for p [7]. 

Example 16.1. Consider the triangulated graph of Fig. 16. 2C. It comprises three eliques, namely, 
(x i , X2, X5), (X2, X3, X4), and (x2,X4,xs). Associating each clique with a node of a tree, Fig. 16.4 
presents three possibilities. The trees in Fig. 16.4A and B are not join trees. Indeed, the intersection 
{x 1. X2, X5} fl {X2, X4, X5} = {X2, X5} does not appear in node (X2, X3, X4). Similar arguments hold true 
for the case of Fig. 16. 4B. In contrast, the tree in Fig. 16. 4C is a join tree, because the intersection 
{xi, X2, X5} fl {X2, X3, X4} = {X2} is contained in (X2, X4, X5). If, now, we have a distribution such as 

p(x) — f] (x ], X2, X5)f2(X2, X3,X4)f2(X2, X 4 , X5), 
the graph in Fig. 16. 4C is a junction tree for p(x). 
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FIGURE 16.4 

The graphs resulting from Fig. 16. 2C and shown in (A) and (B) are not join trees. (C) This is a join tree, because the 
node in the path from (xi, X2, X5) to (X2, X3, X4) contains their intersection, X2. 


We are now ready to state the basic theorem of this section; the one that will allow us to transform 
an arbitrary graph into a graph of a tree structure. 

Theorem 16.1. An undirected graph is triangulated if and only if its eliques can be organized into a 
join tree (Problem 16.1). 

Once a triangulated graph, which is associated with a factorized probability distribution p(x), 
has been transformed into a junction tree, then any of the message passing algorithms, described in 
Chapter 15, can be adopted to perform exact inference. 

16.2.1 CONSTRUCTING A JOIN TREE 

Starting from a triangulated graph, the following algorithmic steps construet a join tree ([32]): 

• Select a node in a maximal clique of the triangulated graph, which is not shared by other eliques. 
Eliminate this node and keep removing nodes from the clique, as long as they are not shared by 
other eliques. Denote the set of the remaining nodes of this clique as S), where i is the number of 
the nodes eliminated so far. This set is called a separator. Use V,- to denote the set of ali the nodes 
in the clique, prior to the elimination process. 

• Select another maximal clique and repeat the process with the index counting the node elimination 
starting from i. 

• Continue the process until ali eliques have been eliminated. Once the previous peeling off procedure 
has been completed, join together the parts that have resulted, so that each separator, S), is joined to 
Vi on one of its sides and to a clique node (set) V(j > i), such that .Sj C V). This is in line with 
the running intersection property. It can be shown that the resulting graph is a join tree (part of the 
proof in Problem 16.1). 

An alternative algorithmic path to construet a join tree, once the eliques have been formed, is the 
following. Build an undirected graph having as nodes the maximal eliques of the triangulated graph. 
For each pair of linked nodes, V/, Vj , assign a weight wtj on the respective edge equal to the cardinality 
of Vi fl Vj. Then run the maximal spanning tree algorithm (e.g., [43]) to identify a tree in this graph 
such that the sum of weights is maximal [41]. It turns out that such a procedure guarantees the running 
intersection property. 
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Example 16.2. Consider the graph of Fig. 16.5, which is described in the seminal paper [44]. Smoking 
can cause lung cancer or bronchitis. A recent visit to Asia increases the probability of tuberculosis. 
Both tuberculosis and cancer can resuit in a positive X-ray finding. Also, ali three diseases can cause 
difficulty in breathing (dyspnea). In the context of the current example, we are not interested in the 
values of the respective probability table, and our goal is to construet a join tree, following the previous 
algorithm. Fig. 16.6 shows a triangulated graph that corresponds to Fig. 16.5. 

The elimination sequence of the nodes in the triangulated graph is graphically illustrated in 
Fig. 16.7. First, node A is eliminated from the clique (A, T) and the respective separator set com- 
prises T. Because only one node can be eliminated (i = I), we indicate the separator as S\. Next, node 
T is eliminated from the clique ( T, L , E) and the Si (i — i + 1) separator comprises L, E. The process 
continues until clique ( B. D, E) is the only remaining one. It is denoted as Vjj, as ali three nodes can 



FIGURE 16.5 

The Bayesian network structure of the example given in [44]. 



FIGURE 16.6 


The graph resulting from the Bayesian network structure of Fig. 16.5, after having been moralized and triangulated. 
The inserted edges are drawn in red. 
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(A) (B) (C) (D) 



FIGURE 16.7 

The sequence of elimination of nodes from the respective eliques of Fig. 16.6 and the resulting separators. 



FIGURE 16.8 

The resulting join tree from the graph of Fig. 16.6. A separator Sj is linked to a clique Vj (j > i) so that S; C Vj. 


be eliminated sequentially (hence, 8 = 5 + 3); there is no other neighboring clique. Fig. 16.8 shows the 
resulting junction tree. Verify the running intersection property. 

16.2.2 MESSAGE PASSING IN JUNCTION TREES 

By its dehnition, a junction tree is a join tree where we have associated a factor, say, i//,, of a probability 
distribution, p, with each one of the eliques. Each factor can be considered as the product of ali potential 
functions, which are defined in terms of the variables associated with the nodes of the corresponding 
clique; hence, the domain of each one of these potential functions is a subset of the variables-nodes 
comprising the clique. Then, focusing on the discrete probability case, we can write 

P(x)=^Y[ir c (x c ), (16.3) 

C 

where c runs over the eliques and x c denotes the variables comprising the respective clique. Because 
a junction tree is a graph with a tree structure, exact inference can take place in the same way as we 
have already discussed in Section 15.7, via a message passing rationale. A two-way message passing 
is also required here. There are, however, some small differences. In the case of the factor graphs, 
which we have considered previously in Chapter 15, the exchanged messages were functions of one 
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variable. This is not necessarily the case here. Moreover, after the bidirectional flow of the messages 
has been completed, what is recovered from each node of the junction tree is the joint probability 
of the variables associated with the clique, Pix,). The computation of the marginal probabilities for 
individual variables requires extra summations in order to marginalize with respect to the rest. 

Note that in the message passing, the following take place: 

• A separator receives messages and passes their product to one of its connected eliques, depending 
on the direction of the message passing flow, that is. 


fis^v(xs) = ]""[ fi v ->s(.xs), 

vsJ\f(S)\V 


(16.4) 


where Af(S) is the index set of the clique nodes connected to S and Af(S)\V is this set excluding 
the index for clique node V. Note that the message is a function of the variables comprising the 
separator. 

• Each clique node performs marginalization and passes the message to each one of its connected 
separators, depending on the direction of the flow. Let V be a clique node and xy the vector of the 
involved variables in it, and let S be a separator node connected to it. The message passed to S is 
given by 


dv^s(xs)= ^v(xv) ]""[ ti s ^v(x s ). 

x v \xs seAf(V)\S 


(16.5) 


By xs we denote the variables in the separator S. Obviously, xs C x y and xy\xs denotes ali variables 
in excluding those in xs', Af(V) is the index set of ali separators connected to V and x s , the 
set of variables in the respective separator ( x s C xy. se Af(V))', Af(V)\S denotes the index set of 
all separators connected to V excluding S. This is basically the counterpart of Eq. (15.50). Fig. 16.9 
shows the respective configuration. 

Once the two-way message passing has been completed, marginals in the clique as well as the 
separator nodes are computed as (Problem 16.3): 



FIGURE 16.9 


Clique node V “collects” all incoming messages from the separators it is connected with (except 5); then it outputs 
a message to S, after the marginalization performed on the product of fy(xy) with the incoming messages. 
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• Clique nodes: 

P(x v ) = ^irv(xv) ]~[ (16.6) 

A seJV(V) 

• Separator nodes: Each separator is connected only to clique nodes. After the two-way message 
passing, every separator has received messages from both flow directions. Then it is shown that 


P(xs) 


1 

Z 


n nv^s(x s ). 

veJ\T(S) 


(16.7) 


An important by-product of the previous message passing algorithm in junction trees concerns the 
joint distribution of ali the involved variables, which turns out to be independent of Z, and it is given 
by (Problem 16.4) 


p(x) = H » Pv{Xv) , 
n.APs(xs)] ds ~ l 


(16.8) 


where ]~[ r and |~[ v run over the sets of clique nodes and separators, respectively, and d s is the number 
of the eliques separator S is connected to. 


Example 16.3. Let us consider the junction tree of Fig. 16.8. Assume that ^i( A, T), irjiT, L , E), 
\jr 3 (S, L, B), i/f 4 (B, L, E), 4r$(X, E), and \[r(>(B, D, E) are known. For example, 


^(A,T) = P(T\A)P(A) 


and 


i r3 (S,L,B) = P(L\S)P(B\S)P(S). 

The message passing can start from the leaves, (A, T) and ( X , E), toward (S, L, B); once this message 
flow has been completed, message passing takes place in the reverse direction. Some examples of 
message computations are given below. 

The message received by node (T, L, E) is equal to 


P-s l ^v 2 (T) — T )• 

A 


Also, 

Bv 2 -+S 2 (L, e) = Y, i/2(T, L, E)f2 Sl ^v 2 (T) = Hs 2 ^-v 4 (E, E), 

T 

and 

: 3 (E, — ^4 (B, L , £')/45 2 ^y 4 (L, E)h s ^ V4 (B 1 E). 

E 

The rest of the messages are computed in a similar way. 
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For the marginal probability P(T, L, E) of the variables in clique node V 2 , we get 


P(T, L, E) — irv 2 (T , L, E)ns 2 ^V 2 (E, E)P-S l ^V 2 (T)- 


Observe that in this product, all other variables, besides T , L, and E. have been marginalized out. Also, 

P(L, E) = ny 4 ^s 2 ( L ’ E)Ev 2 ^s 2 (L^ E). 


Remarks 16.1. 

• Note that a variable is part of more than one node in the tree. Hence, if one is interested in obtain- 
ing the marginal probability of an individual variable, this can be obtained by marginalizing over 
different variables in different nodes. The properties of the junction tree guarantee that all of them 
give the same resuit (Problem 16.5). 

• We have already commented that there is not a unique way to triangulate a graph. A natural question 
is now raised: Are all the triangulated versions equivalent from a computational point of view? 
Unfortunately, the answer is no. Let us consider the simple case where all the variables have the 
same number of possible States, k. Then the number of probability values for each clique node 
depends on the number of variables involved in it, and we know that this dependence is of an 
exponential form. Thus, our goal while triangulating a graph should be to implement it in such a 
way that the resulting eliques are as small as possible with respect to the number of nodes-variables 
involved. Let us define the size of a clique, V, , as st — k n ‘, where «, denotes the number of nodes 
comprising the clique. Ideally, we should aim at obtaining a triangulated version (or equivalently 
an elimination sequence) so that the total size of the triangulated graph, JL s,-, where i runs over 
all eliques, is minimum. Unfortunately, this is an NP-hard task [1]. One of the earliest algorithms 
proposed to obtain low-size triangulated graphs is given in [71]. A survey of related algorithms is 
provided in [39]. 


16.3 APPROXIMATE INFERENCE METHODS 

So far, our focus has been on presenting efficient algorithms for exact inference in graphical models. 
Although such schemes form the basis of inference and have been applied in a number of applications, 
often one encounters tasks where exact inference is not practically possible. At the end of the previ- 
ous section, we discussed the importance of small-sized eliques. However, in a number of cases, the 
graphical model may be so densely connected that it renders the task of obtaining eliques of a small 
size impossible. We will soon consider some examples. 

In such cases, resorting to methods for tractable approximate inference is the only viable alternative. 
Obviously, there are various paths to approach this problem and a number of techniques have been 
proposed. Our goal in this section is to discuss the main directions that are currently popular. Our 
approach will be more on the descriptive side than that of rigorous mathematical proofs and theorems. 
The reader who is interested in delving deeper into this topic can refer to more specialized references, 
which are given in the text below. 
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16.3.1 VARIATIONAL METHODS: LOCAL APPROXIMATION 


The current and the next subsections draw a lot of their theoretical basis on the variational approxima- 
tion methods, which were introduced in Chapter 13 and in particular in Sections 13.2 and 13.8. 

The main goal in variational approximation methods is to replace probability distributions with 
computationally attractive bounds. The effect of such deterministic approximation methods is that 
it simplifies the computations; as we will soon see, this is equivalent to simplifying the graphical 
structure. Yet, these simplifications are carried out in the context of an associated optimization process. 
The functional form of these bounds is very much problem dependent, so we will demonstrate the 
methodology via some selected examples. 

Two main directions are followed: the sequentiaI one and the block one [34]. The former will be 
treated in this subsection and the latter in the next one. 

In the sequential methods, the approximation is imposed on individual nodes in order to modify 
the functional form of the local probability distribution functions. This is the reason we called them 
local methods. One can impose the approximation on some of the nodes or to ali of them. Usually, 
some of the nodes are selected, whose number is sufficient so that exact inference can take place with 
the remaining ones, within practically acceptable computational time and memory size. An alternative 
viewpoint is to look at the method as a sparsification procedure that removes nodes so as to transform 
the original graph to a “computationally” manageable one. There are different scenarios on how to 
select nodes. One way is to introduce approximation to one node at a time until a sufficiently simplified 
structure occurs. The other way is to introduce approximation to ali the nodes and then reinstate the 
exact distributions one node at a time. The latter of the two has the advantage that the network is 
computationally tractable ali the way (see, for example, [30]). Local approximations are inspired by 
the method of bounding convex/concave functions in terms of their conjugate ones, as discussed in 
Section 13.8. Let us now unveil the secrets behind the method. 

Multiple-Cause Networks and the Noisy-OR Model 

In the beginning of Chapter 15 (Section 15.2) we presented a simplified case from the medical diagnosis 
held, concerning a set of diseases and findings. Adopting the so-called noisy-OR model, we arrived at 
Eqs. (15.4) and (15.5), which are repeated here for convenience. We have 



P(fi =0\d) = exp I - ^ 
V 7'e Pa, 


(16.9) 



(16.10) 


where we have exploited the experience we have gained so far and we have introduced in the notation 
the set of the parents Pa, of the /th finding. The respective graphical model belongs to the family of 
multiple-cause networks (Section 15.3.6) and it is shown in Fig. 16.10A. We will now pick a specihc 
node, say the /th node, assume that it corresponds to a positive finding (/,■ = 1), and demonstrate 
how the variational approximation method can offer a way out from the “curse” of the exponential 
dependence of the joint probability on the number of the involved terms; recall that this is caused by 
the form of Eq. (16.10). 
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FIGURE 16.10 

(A) A Bayesian network for a set of findings and diseases. To simplify the figure only a few connections are shown. 
The node associated with the ith finding, together with its parents and respective edges, are shown in red; it is the 
node on which variational approximation is introduced. (B) After the variational approximation is performed for 
node i, the edges joining it with its parents are removed. At the same time, the prior probabilities of the respective 
parent nodes change values. This is shown in the figure for the _/th disease. (C) The graph that would have resulted 
after the moralization step, focusing on node i. 


Derivation of the variational bound: The function 1 — expC—x) belongs to the so-called log- 
concave family of functions, meaning that 

f(x) — ln(l — exp(— x)), x > 0 , 

is concave (Problem 16.9). Being a concave function, we know from Section 13.8 that it is upper 
bounded by 
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f(x)<H*-f*(H), 

where /*(£) is its conjugate function. Tailoring it to the needs of Eq. (16.10) and using in place of §, 
to explicitly indicate the dependence on node i, we obtain 


P(fi = l\d) < exp 





or 


P(/, = l|rf)<exp(-/*(£)) Y\ (exp ($i6ij)) di , 

,/ePa, 


(16.11) 


(16.12) 


where (Problem 16.10) 


f*(Hi) = -§« In(&) + (Hi + 1) lnfe + 1). Hi > 0. 

Note that usually, a constant Oio is also present in the linear terms (£^ g p a . Oijdj + 0,o) and in this case 
the first exponent in the upper bound becomes exp(—/*(§,•) + Hi@ia)- 

Let us now observe Eq. (16.12). The first factor on the right-hand side is a constant, once is deter- 
mined. Moreover, each one of the factors, exp (HiOij), is also a constant raised in dj. Thus, substituting 
Eq. (16.12) in the products in Eq. (15.1), in order to compute, for example, Eq. (15.3), each one of 
these constants can be absorbed by the respective P(dj), that is, 

P(dj) a P (>dj) exp {Hi 0; jdj), j e Pa,. 

Basically, from a graphical point of view, we can equivalently consider that the /th node is delinked 
and its influence on any subsequent processing is via the modified factors associated with its parent 
nodes (see Fig. 16.10B). In other words, the variational approximation decouples the parent nodes. In 
contrast, for exact inference, during the moralization stage, all parents of node i are connected. This is 
the source of computational explosion (see Fig. 16.10C). The idea is to remove a sufficient number of 
nodes, so that the remaining network can be handled using exact inference methods. 

There is stili a main point to be addressed: how the various s are obtained. These are computed 
to make the bound as tight as possible, and any Standard optimization technique can be used. Note that 
this minimization corresponds to a convex cost function (Problem 16.1 1). Besides the upper bound, 
a lower bound can also be derived [27]. Experiments performed in [30] verify that reasonably good 
accuracies can be obtained in affordable computational times. The method was first proposed in [27]. 

The Boltzmann Machine 

The Boltzmann machine, which was introduced in Section 15.4.2, is another example where any 
attempt for exact inference is confronted with eliques of sizes that make the task computationally 
intractable [28]. 
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We will demonstrate the use of the variational approximation in the context of the computation of 
the normalizing constant Z. Recall from Eq. (15.23) that 


z=e- p (-e (e d ‘j xix j+ 9i ° xi j j 

= E E ex p (- E E d ‘j xix J + 9iox ‘ ) 

x\x k x k =0 \ i \j>i / , 


(16.13) 


where we chose node x* to impose variational approximation. We split the summation into two, one 
with regard to Xk and one with regard to the rest of the variables; x\xk denotes summation over ali 
variables excluding Xk- Performing the inner sum in Eq. (16.13) (terms different to Xk and xk = 0, 
Xk = 1), we get 


z =E ex p I - E 

x\x k \ i^k 



+ &iQXi 



eX P I ~J2 9kiX ' ~ 9k0 

i^k 


where i < j f k indicates that both i and j are different from k. 

Derivation of the variational bound: The function 1 + exp(— x), rei, is log-convex (Prob- 
lem 16.12); thus, following similar arguments to those adopted for the derivation of the bound in 
Eq. (16.12), but for convex instead of concave functions, we obtain 


z > e ex p I - E ( E 9 i J x ' x j + 9 >o x i 

x\x k \ ijtk \i<jjtk 
X exp f & ( Y^OkjXj +Oko \ - /*(&) ] , 


(16.14) 


where /*(£) is the respective conjugate function (Problem 16.12). Note that the second exponential in 
the bound can be combined with the first one and we can write 


where 


and 


Z > exp (-/*(&) + &0*o) E exp ( ~E ( E 9i J x ‘ 

x\x k \ ijtk \i<j^k 


Oij — Oij, i j ^ k, 
&i0 = 0;o - Hk^ki > i f=k. 


X j + G i0 Xi 
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FIGURE 16.11 

(A) An MRF corresponding to a Boltzmann machine. (B) The resulting MRF after removing xs via variational 
approximation. Note that if X5 is removed next, then the remaining graphical model is a chain. 


In other words, if from now on we replace Z with the bound, it is as if node x/. has been removed 
and the remaining network is a Boltzmann machine with one node less and the respective parameters 
modified, compared to the original ones. The value of £* can be obtained via optimization so as to 
make the bound as tight as possible. 

Fig. 16.11 illustrates the effect of applying variational approximation to node xg. For the case of 
this figure, note that if X 5 is removed next (after xg) the remaining graphical structure becomes a chain 
and exact inference can be carried out. 

Following exactly similar arguments and employing the conjugate of the sigmoid function (Prob- 
lem 13.18), one can apply the technique to the sigmoidal Bayesian networks, discussed in Sec- 
tion 15.3.4 (see also [27]). 


16.3.2 BLOCK METHODS FOR VARIATIONAL APPROXIMATION 

In contrast to the previous approach, where the approximation is introduced on selected nodes indi- 
vidually, here the approximation is imposed on a set of nodes. Once more, a derived bound of the 
involved probability distribution is optimized. In principle, the method is equivalent to imposing a 
specific graphical structure on the nodes, which can then be addressed by tractable exact inference 
techniques. In the sequel, the family of distributions, which can be factorized over this simplified sub¬ 
structure, is optimized with respect to a set of variational parameters. The method builds upon the 
same arguments as those used in Section 13.2. We will retain the same notation and we will provide 
the related formulas for the discrete variable case, to be used later on in our selected examples. 

Let X be the set of observed and X 1 the set of the latent random variables associated with the 
nodes of a graphical structure; in the graphical model terminology we can refer to them as evidence 
and hidden nodes, respectively. Define 


HQ)= G(*')In 

xeX 1 


P{X,X l ) 

Q{X') ’ 


(16.15) 
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where Q is any probability function. Then, from Eq. (16.15), we readily obtain 

T{Q) = \nP{X) + Y 6 (*')ln P q(x*) ' 

xeX 1 

or 

\nP{X) = F{Q) + Y Q{X ' )ln p (Xl\XY (16 - 16) 

xeX 1 

Note that the second term in Eq. (16.16) is the Kullback-Leibler (KL) divergence between P(X l \X) 
and QiX 1 ). Because KL divergence is always nonnegative (Problem 12.7), we can write 

ln P(X)>T(Q), 

and the lower bound is maximized if we minimize the KL divergence. 

Let us now return to our goal, i.e., given the evidence, to perform inference on the graph associated 
with P(X l \X). If this cannot be performed in a tractable way, the method adopts an approximation, 
Q(X l ), of P(X l \X) and at the same time imposes a specific factorization on QiX 1 ) (which equiva- 
lently induces a specific graphical structure) so that exact inference techniques can be employed. From 
the adopted family of distributions, we choose the one that minimizes the KL divergence between 
P{X l | X) and Q(X'). Such a choice guarantees the maximum lower bound for the log-evidence func¬ 
tion. Among the different ways of factorization, the so-called metui field factorization is the simplest 
and, possibly, the most popular. The imposed structure on the graph has no edges, which leads to a 
complete factorization of Q(X'), that is, 


Q{X l ) = n Qi (xj) : mean field factorization. 

i :xieX l 


(16.17) 


The Mean Field Approximation and the Boltzmann Machine 

As we already know, the joint probability for the Boltzmann machine is given by 


P{X, X 1 ) = ^ exp Y ^X d 'i XiX i 


■ Oioxj 


where some of x, (xj) belong to X and some to X 1 . Our first goal is to compute P(X l \X) so as to 
use it in the KL divergence. Note that if both x,-, xj e X, their contribution results to a constant, which 
is finally absorbed by the normalizing factor Z. If one is observed and the other one is latent, then 
the product contribution becomes linear with regard to the hidden variable and it is absorbed by the 
respective linear term. Then we can write 


P(X \X) = — exp 
Z 


XI X SijXiXj XdiQXi 


i:xi€.X l \xj &X l :j>i 


(16.18) 






16.3 APPROXIMATE INFERENCE METHODS 837 


We now turn our attention to the form of Q. Due to the (assumed) binary nature of the variables, a 
sensible completely factorized form of Q is [34] 

Q(X , - t i)= ]"[ 06.19) 

i:xi€X‘ 

where the dependence on the variational parameters, /x. is explicitly shown. Also, due to the adopted 
Bernoulli distribution for each variable, E[x,- ] = /x/ (Chapter 2). The goal now is to optimize the KL 
divergence with respect to the variational parameters. Plugging Eqs. (16.18) and (16.19) into 

KL(Q\\P) = Y Q(X L ,fi)\n p(X i^ y 06.20) 

X[ £ 


we obtain (Problem 16.13, [34]) 


KL(Q\\P) = Y 


IM In /x, + (1 - Hi) ln(l - m) + +^0Mi 


J>1 


lnZ, 


whose minimization with regard to /x,- hnally results in (Problem 16.13) 


/X,' = CT - E OjjiAj+Ojo | | : mean held equations, 


(16.21) 


where a (■) is the sigmoid link function; recall from the dehnition of the Ising model that = O j, ^ 0 if 
X/ and Xj are connected and zero otherwise. Plugging the values /x, into Eq. (16.19), an approximation 
of P(X r \X) in terms of Q(X l ; fi ) has been obtained. 

Eq. (16.21) is equivalent to a set of coupled equations known as meanfield equations and they are 
used in a recursive manner to compute a solution fixed point set, assuming that one exists. Eq. (16.21) 
is quite interesting. Although we assumed independence among hidden nodes, imposing minimization 
of the KL divergence, information related to the (true) mutually dependent nature of the variables (as 
this is conveyed by P(X l \X)) is “embedded” into the mean values with respect to Q(X l |/x); the mean 
values of the respective variables are interrelated. Eq. (16.21) can also be viewed as a message passing 
algorithm (see Fig. 16.12). Fig. 16.13 shows the graph associated with a Boltzmann machine prior to 
and after the application of the mean held approximation. 

Note that what we have said before is nothing but an instance of the variational EM algorithm, 
presented in Section 13.2; as a matter of fact, Eq. (16.21) is the outcome of the E-step for each one of 
the factors of Q ;, assuming the rest are fixed. 

Thus far in the chapter, we have not mentioned the important task of how to obtain estimates of 
the parameters describing a graphical structure; in our current context, these are the parameters 0,-y 
and 0,o, comprising the set 0. Although the parameter estimation task is discussed at the end of this 
chapter, there is no harm in saying a few words at this point. Let us give the dependence of 0 explicitly 
and denote the involved probabilities as Q(X l \ /x, 0), P(X, X 1 : 0), and P(X l \X; 0). Treating 0 as an 
unknown parameter vector, we know from the variational EM that this can be iteratively estimated by 
adding the M-step in the algorithm and optimizing the lower bound, with regard to 0 , fixing the 

rest of the involved parameters (see, for example, [29,69]). 
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Remarks 16.2. 

• The mean field approximation method has also been applied in the case of sigmoidal neural net- 
works, defined in Section 15.3.4 (see, e.g., [69]). 

• The mean field approximation involving the completely factorized form of Q is the simplest and 
crudest approximation. More sophisticated attempts have also been suggested, where Q is allowed 
to have a richer structure while retaining its computational tractability (see, e.g., [17,31,80]). 

• In [82], the mean field approximation has been applied to a general Bayesian network, where, as we 
know, the joint probability distribution is given by the product of the conditionals across the nodes. 


Pix) = fi /Ax,| pa <). 


Unless the conditionals are given in a structurally simple form, exact message passing can become 
computationally tough. For such cases, the mean field approximation can be introduced in the hid- 
den variables, i.e., 


q(x')= n &(*■•), 


i:xteX l 


which are then estimated so as to maximize the lower bound T(Q) in (16.15). Following the ar- 
guments that were introduced in Section 13.2, this is achieved iteratively starting from some initial 
estimates and at each iteration step optimization takes place with respect to a single factor, holding 
the rest fixed. At the (j + l)th step, the mth factor is obtained as (Eq. (13.15)) 


ln Q ( n [ + 1 * (xj n ) = E In ]~[ p (x, | Pa,-) + constant 



nk = o{-z) 




S— 1 


FIGURE 16.12 


Node k is connected to S nodes and receives messages from its neighbors, and then passes messages to its neigh- 
bors. 
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FIGURE 16.13 

(A) The nodes of the graph representing a Boltzmann machine. (B) The mean field approximation results in a graph 
without edges. The dotted lines indicate the deterministic relation that is imposed among nodes, which were linked 
prior to the approximation with node X5. 


where the expectation is with respect to the currently available estimates of the factors, exclud- 
ing Q m , and x m is the respective hidden variable. When restricting the conditionals within the 
conjugate-exponential family, the computations of expectations of the logarithms become tractable. 
The resulting scheme is equivalent to a message passing algorithm, known as variational message 
passing , and it comprises passing moments and parameters associated with the exponential distri- 
butions. An implementation example of the variational message passing scheme in the context of 
MIMO-OFDM communication systems is given in [6,23,38]. 

16.3.3 LOOPY BELIEF PROPAGATION 

The message passing algorithms, which were considered previously for exact inference in graphs with 
a tree structure, can also be used for approximate inference in general graphs with cycles (loops). Such 
schemes are known as loopy belief propagation algorithms. 

The idea of using the message passing (sum-product) algorithm with graphs with cycles goes back 
to Pearl [ 57] . Note that algorithmically, there is nothing to prevent us from applying the algorithm in 
such general structures. On the other hand, if we do it, there is no guarantee that the algorithm will 
converge in two passes and, more importantly, that it will recover the true values for the marginals. 
As a matter of fact, there is no guarantee that such a message propagation will ever converge. Thus, 
without any ciear theoretical understanding, the idea of using the algorithm in general graphs was 
rather forgotten. Interestingly enough, the spark for its comeback was ignited by a breakthrough in 
coding theory, under the name turbo codes [5]. It was empirically verified that the scheme can achieve 
performance very close to the theoretical Shannon limit. 

Although, in the beginning, such coding schemes seemed to be unrelated to belief propagation, it 
was subsequently shown [49] that the turbo decoding is just an instance of the sum-product algorithm, 
when applied to a graphical structure that represents the turbo code. As an example of the use of the 
loopy belief propagation for decoding, consider the case of Fig. 15.19. This is a graph with cycles. 
Applying belief propagation on this graph, we can obtain after convergence the conditional probabili- 
ties P(xj |y,) and, hence, decide on the received sequence of bits. This finding revived interest in loopy 
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belief propagation; after all, it “may be useful” in practice. Moreover, it initiated activity in theoretical 
research in order to understand its performance as well as its more general convergence properties. 

In [83], it is shown, on the basis of pair-wise connected MRFs (undirected graphical models with 
potential functions involving at most pairs of variables, e.g., trees), that whenever the sum-product 
algorithm converges on loopy graphs, the fixed points of the message passing algorithm are actually 
stationary points of the so-called Re the free energy cost. This is directly related to the KL divergence 
between the true and an approximating distribution (Section 12.7). Recall from Eq. (16.20) that one of 
the terms in KL divergence is the negative entropy associated with Q. In the mean field approximation, 
this entropy term can be easily computed. However, this is not the case for more general structures 
with cycles and one has to be content to settle for an approximation. The so-called Bethe entropy 
approximation is employed, which in turn gives rise to the Bethe free energy cost function. To obtain 
the Bethe entropy approximation, one “embeds” into the approximating distribution, Q. a structure 
that is in line with (16.8), which holds true for trees. Indeed, it can be checked (try it) that for (singly 
connected) trees, the product in the numerator runs over all pairs of connected nodes in the tree; let us 
denote it as J j Pij(xi,Xj). Also, d s is equal to the number of nodes that node s is connected with. 
Thus, we can write the joint probability as 


_ I !:/■/, /> //< V/■ -V./) 


Note that nodes that are connected to only one node have no contribution in the denominator. Then the 
entropy of the tree, that is, 

E = — E[ln P(x)], 


can be written as 


E = (xi ’ X J } ln P V (x ‘ + (ds 


D !>(*.) Jn^fc). 


(ij) 


(16.22) 


Thus, this expression for the entropy is exact for trees. However, for more general graphs with cycles, 
this can only hold approximately true, and it is known as the Bethe approximation of the entropy. 
The closer to a tree a graph is, the better the approximation becomes; see [83] for a concise, related 
introduction. 

It turns out that in the case of trees, the sum-product algorithm leads to the true marginal values 
because no approximation is involved and minimizing the free energy is equivalent to minimizing the 
KL divergence. Thus, from this perspective, the sum-product algorithm gets an optimization flavor. 
In a number of practical cases, the Bethe approximation is accurate enough, which justifies the good 
performance that is often achieved in practice by the loopy belief algorithm (see, e.g., [52]). The loopy 
belief propagation algorithm is not guaranteed to converge in graphs with cycles, so one may choose 
to minimize the Bethe energy cost directly; although such schemes are slower compared to message 
passing, they are guaranteed to converge (e.g., [85]). 

An alternative interpretation of the sum-product algorithm as an optimization algorithm of an ap- 
propriately selected cost function is given in [75,77]. A unifying framework for exact, as well as 
approximate, inference is provided in the context of the exponential family of distributions. Both the 
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mean field approximation and the loopy belief propagation algorithm are considered and viewed as dif¬ 
ferent ways to approximate a convex set of realizable mean parameters, which are associated with the 
corresponding distribution. Although we will not proceed in a detailed presentation, we will provide 
a few “brush strokes,” which are indicative of the main points around which this theory develops. At 
the same time, this is a good excuse for us to be exposed to an interesting interplay among the notions 
of convex duality, entropy, cumulant generating function, and mean parameters, in the context of the 
exponential family. 

The general form of a probability distribution in the exponential family is given by (Section 12.8.1) 

p(x: 0) = C exp I ^ QjUi(x) 

\i e/ 

= exp 1 0 7 u(x) — A{0)^ , 

with 

A(0) = — lnC = ln J exp^ 7 «(jc)^ dx, 

where the integral becomes summation for discrete variables; A(0) is a convex function and it is known 
as the log-partition or cumulant generating function (Problem 16.14, [75,77]). It turns out that the 
conjugate function of A(0), denoted as A*(p,), is the negative entropy function of p(x\ 0{ji)), where 
0(/i) is the value of 0 where the maximum (in the definition of the conjugate function) occurs given 
the value of /t; we say that 0(p) and fi are dually coupled (Problem 16.15). Moreover, 

E[m(x)] = /t, 

where the expectation is with respect to p(x\ 0(p)). This is an interesting interpretation of p as a 
mean parameter vector; recall from Section 12.8.1 that these mean parameters define the respective 
exponential distribution. Then 

A(0) = max (0 T p — A*(p)), (16.23) 

iceM V / 

where Ai is the set that guarantees that A *(p) is finite, according to the definition of the conjugate 
function in Eq. (13.78). It turns out that in graphs of a tree structure, the sum-product algorithm is an 
iterative scheme of solving a Lagrangian dual formulation of Eq. (16.23) [75,77]. Moreover, in this 
case, the set Ai, which can be shown to be a convex one, is possible to be characterized explicitly in 
a straightforward way and the negative entropy A*(p) has an explicit form. These properties are no 
more valid in graphs with cycles. The mean field approximation involves an inner approximation of the 
set Ai ; hence, it restricts optimization to a limited class of distributions, for which the entropy can be 
recovered exactly. On the other hand, the loopy belief algorithm provides an outer approximation and, 
hence, enlarges the class of distributions; entropy can only approximately be recovered, which for the 
case of pair-wise MRFs can take the form of the Bethe approximation. 

The previously summarized theoretical findings have been generalized to the case of junction trees, 
where the potential functions involve more than two variables. Such methods involve the so-called 
Kikuchi energy, which is a generalization of the Bethe approximation [77,84]. Such arguments have 
their origins in statistical physics [37]. 
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Remarks 16.3. 

• Following the success of loopy belief propagation in turbo decoding, further research verified its per- 
formance potential in a number of tasks, such as low-density parity-check codes [15,47], network 
diagnostics [48], sensor network applications [24], and multiuser Communications [70]. Further- 
more, a number of modified versions of the basic scheme have been proposed. In [74], the so-called 
tree-reweighted belief propagation is proposed. In [26], arguments from information geometry are 
employed and in [78], projection arguments in the context of information geometry are used. More 
recently, the belief propagation algorithm and the mean field approximation were proposed to be 
optimally combined to exploit their respective advantages [65]. A related review can be found in 
[76]. In a nutshell, this old scheme is stili alive and kicking! 

• In Section 13.10, the expectation propagation algorithm was discussed in the context of parameter 
inference. The scheme can also be adopted in the more general framework of graphical models, if 
the place of parameters is taken by the hidden variables. Graphical models are particularly tailored 
for this approach because the joint PDF is factorized. It turns out that if the approximate PDF is 
completely factorized, corresponding to a partially disconnected network, the expectation propaga¬ 
tion algorithm turns out to be the loopy belief propagation algorithm [50]. In [51], it is shown that 
a new family of message passing algorithms can be obtained by utilizing a generalization of the 
KL divergence as the optimizing cost. This family encompasses a number of previously developed 
schemes. 

• Besides the approximation techniques that were previously presented, another popular pool of meth- 
ods is the Markov chain Monte Carlo (MCMC) framework. Such techniques were discussed in 
Chapter 14 (see, for example, [25] and the references therein). 


16.4 DYNAMIC GRAPHICAL MODELS 

All the graphical models that have been discussed so far were developed to serve the needs of random 
variables whose statistical properties remained fixed over time. However, this is not always the case. 
As a matter of fact, the terms time adaptivity and time variatiori are Central for most parts of this book. 
Our focus in this section is to deal with random variables whose statistical properties are not fixed 
but are allowed to undergo changes. A number of time series as well as sequentially obtained data 
fall under this setting with applications ranging from signal processing and robotics to finance and 
bioinformatics. 

A key difference here, compared to what we have discussed in the previous sections of this chapter, 
is that now observations are sensed sequentially and the specific sequence in which they occur carries 
important information, which has to be respected and exploited in any subsequent inference task. For 
example, in speech recognition, the sequence in which the feature vectors resuit is very important. In 
a typical speech recognition task, the raw speech data are sequentially segmented in short (usually 
overlapping) time Windows and from each window a feature vector is obtained (e.g., DFT of the sam- 
ples in the respective time slot). This is illustrated in Fig. 16.14. These feature vectors constitute the 
observation sequence. Besides the information that resides in the specific values of these observation 
vectors, the sequence in which the observations appear discloses important information about the word 
that is spoken; our language and spoken words are highly structured human activities. Similar argu- 
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FIGURE 16.14 

A speech segment and N time Windows, each one of length equal to 500 ms. They correspond to time intervals 
[0,500], [500, 1000], and [3500,4000], respectively. From each one of them, a feature vector, y , is generated. In 
practice, an overlap between successive Windows is allowed. 


ments hold true for applicatioris such as learning and reasoning concerning biological molecules, for 
example, DNA and proteins. 

Although any type of graphical model has its dynamic counterpart, we will focus on the family of 
dynamic Bayesian networks and, in particular, a specific type known as hidden Markov models. 

A very popular and effective framework to model sequential data is via the so-called state- 
observcition or state-space models. Each set of random variables, y„ e R/, which are observed at time n, 
is associated with a corresponding hidden/latent random vector x„ (not necessarily of the same dimen- 
sionality as that of the observations). The system dynamics are modeled via the latent variables and 
observations are considered to be the output of a measuring noisy sensing device. The so-called latent 
Markov models are built around the following two independence assumptions: 

(1) x„ + i±(x 1 ,...,x„_i)|x„, (16.24) 

(2) y„ -L (x 1; ..., x„_i, x„ + i,..., x N )\ x„, (16.25) 

where N is the total number of observations. The first condition detines the system dynamics via the 
transition model 


P (A/z —(— 1 l'*T> • • • > X n ) — p(x n + l l*„). 


(16.26) 
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FIGURE 16.15 

The Bayesian network corresponding to a latent Markov model. If latent variables are of a discrete nature, this cor- 
responds to an HMM. If both observed and latent variables are continuous and follow a Gaussian distribution, this 
corresponds to a linear dynamic system (LDS). Note that the observed variables comprise the leaves of the graph. 


and the second one via the observation model 


p{y n \xi,...,x N ) = p{y n \x n ). (16.27) 

In words, the future is independent ofthe past given the present, and the obser\’ations are independent 
of the future and past given the present. 

The previously stated independencies are graphically represented via the graph of Fig. 16.15. If the 
hidden variables are of a discrete nature, the resulting model is known as a hidden Markov model. If, on 
the other hand, both hidden and observation variables are of a continuous nature, the resulting model 
gets rather involved to deal with. However, analytically tractable tools can be and have been developed 
for some special cases. In the so-called linear dynamic systems (LDSs), the system dynamics and the 
generation of the observations are modeled as 

x„ = F n x n - i + i)„, (16.28) 

y„ = H„x n + v,„ (16.29) 

where and v„ are zero mean, mutually independent noise disturbances modeled by Gaussian distri- 
butions. This is the celebrated Kalman filter, which we have already discussed in Chapter 4 and it will 
also be considered, from a probabilistic perspective, in Chapter 17. The probabilistic counterparts of 
Eqs. (16.28) and (16.29) are 


(16.30) 

(16.31) 

where Q„ and R n are the covariance matrices of r)„ and v n , respectively. 


p(x n \x n -i)=Af(x n \F n x„-i, Q „), 
p(y n \x n )=N{y n \H n x n , R„), 


16.5 HIDDEN MARKOV MODELS 

Hidden Markov models are represented by the graphical model in Fig. 16.15 and Eqs. (16.26) and 
(16.27). The latent variables are discrete; hence, we write the transition probability as P{x n |jc„_i) and 
this corresponds to a table of probabilities. Observation variables can either be discrete or continuous. 









16.5 HIDDEN MARKOV MODELS 


845 



FIGURE 16.16 

The unfolding in time of a trajectory that associates observations with States. 


Basically, an HMM is used to model a quasistationary process that undergoes sudden changes among 
a number of, say, K subprocesses. Each one of these subprocesses is described by different statistical 
properties. One could alternatively view it as a combined system comprising a number of subsystems; 
each one of these subsystems generates data/observations according to a different statistical model; 
for example, one may follow a Gaussian and the other one a Student’s t distribution. Observations 
are emitted by these subsystems; however, once an observation is received, we do not know which 
subsystem this was emitted from. This reminds us of the mixture modeling task of a PDF; however, in 
mixture modeling, we did not care about the sequence in which observations occur. 

For modeling purposes, we associate with each observation y n a hidden variable, k„ = 1,2 ,K, 
which is the (random) index indicating the subsystem/subprocess that generated the respective obser¬ 
vation vector. We will call it the state. Each k„ corresponds to x„ of the general model. The sequence 
of the complete observation set ( y n , k n ), n = 1,2, .... N, forms a trajectory in a two-dimensional grid, 
having the States on one axis and the observations on the other. This is shown in Fig. 16.16 for K — 3. 
Such a path reveals the origin of each observation; y l was emitted from state k\ = 1 ,y 2 from kj = 2, 
from Aq = 2, and y N from Ay = 3. Note that each trajectory is associated with a probability distribution, 
that is, the joint distribution of the complete set. Indeed, the probability that the trajectory of Fig. 16.16 
will occur depends on the value of P ((jq, Aq — 1), (y 2 , k 2 — 2), (_y 3 , A 3 = 2),..., (_y w , Ayr = 3)). We 
will soon see that some of the possible trajectories that can be drawn in the grid are not allowed in prac- 
tice; this may be due to physical constraints concerning the data generation mechanism that underlies 
the corresponding system/process. 

Transition probabilities. As already said, the dynamics of a latent Markov model are described in 
terms of the distribution p(x„ x„_i), which for an HMM becomes the set of probabilities 

P(k n \k n -i), k n , k n -\ = 1,2 ,..., K, 

indicating the probability of the system to “jump” at time n to state k n from state k„ _ i, where it was at 
time n — 1. In general, this table of probabilities may be time-varying. In the Standard form of HMMs, 
this is considered to be independent of time and we say that our model is homogeneous. Thus, we can 
write 


P{k n \k n -x) = P(i\j) := P U , i, j = l,2,...,K. 
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FIGURE 16.17 

(A) A three-state left-to-right HMM model. (B) Each state is characterized by different statistical properties. 


Note that some of these transition probabilities can be zero, depending on the modeling assump- 
tions. Fig. 16.17A shows an example of a three-state system. The model is of the so-called left-to-right 
type, where two types of transitions are allowed: (a) self transitions and (b) transitions from a state of a 
lower index to a state of a higher index. The system, once it jumps into a state k, emits data according 
to a probability distribution p{y\k), as illustrated in Fig. 16.17B. Besides the left-to-right models, other 
alternatives have also been proposed [8,63]. The States correspond to certain physical characteristics 
of the corresponding system. For example, in speech recognition, the number of States that are chosen 
to model a spoken word depends on the expected number of sound phenomena (phonemes) within the 



FIGURE 16.18 

The black trajectory is not allowed to occur under the HMM model of Fig. 16.17. Transitions from state k = 3 to 
state k = 2 and from k = 3 to k = 1 are not permitted. In contrast, the state unfolding in the red curve is in agree- 
ment with the model. 
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word. Typically, three to four States are used per phoneme. Another modeling path uses the average 
number of observations resulting from various versions of a spoken word as an indication of the num- 
ber of States. Seen from the transition probabilities perspective, an HMM is basically a stochastic finite 
state automaton that generates an observation string. Note that the semantics of Fig. 16.17 is different 
from and must not be confused with the graphical structure given in Fig. 16.15. Fig. 16.17 is a graphi- 
cal interpretation of the transition probabilities among the States; it says nothing about independencies 
among the involved random variables. Once a state transition model has been adopted, some trajecto- 
ries in the trellis diagram of Fig. 16.16 will not be allowed. In Fig. 16.18, the red trajectory is not in 
line with the model of Fig. 16.17. 

16.5.1 INFERENCE 

As in any graphical modeling task, the ultimate goal is inference. Two types of inference are of par- 
ticular interest in the context of classification/recognition. Let us discuss it in the framework of speech 
recognition; similar arguments hold true for other applications. We are given a set of (output vari¬ 
ables) observations, y i,..., y ,.y, and we have to decide to which spoken word these correspond. In the 
database, each spoken word is represented by an HMM model, which is the resuit of extensive training. 
An HMM model is fully described by the following set of parameters: 

HMM model parameters 

1. Number of States K. 

2. The probabilities for the initial state at n = 1 to be at state k, that is, l\, k — 1,2. K. 

3. The set of transition probabilities Pjj, i, j = 1,2,..., K. 

4. The state emission distributions p(y\k), k — 1,2,..., K , which can be either discrete or continuous. 
Often, these probability distributions may be parameterized, p(y\k: Ok), k — 1.2,.... K. 

Prior to inference, all the involved parameters are assumed to be known. Learning of the HMM param¬ 
eters takes place in the training phase; we will come to it shortly. 

For the recognition, a number of scores can be used. Here we will discuss two alternatives that 
come as a direct consequence of our graphical modeling approach. For a more detailed discussion, see, 
for example, [72]. 

In the first one, the joint distribution for the observed sequence is computed, after marginalizing out 
all hidden variables; this is done for each one of the models/words. Then the word that scores the larger 
value is selected. This method corresponds to the sum-product rule. The other path is to compute, for 
each model/word, the optimal trajectory in the trellis diagram; that is, the trajectory that scores the 
highest joint probability. In the sequel, we decide in favor of the model/word that corresponds to the 
largest optimal value. This method is an implementation of the max-sum rule. 

The Sum-Product Algorithm: the HMM Case 

The first step is to transform the directed graph of Fig. 16.15 to an undirected one; a factor graph or 
a junction tree graph. Note that this is trivial for this case, as the graph is already a tree. Let us work 
with the junction tree formulation. Also, in order to use the message passing formulas of (16.4) and 
(16.5) as well as (16.6) and (16.7) for computing the distribution values, we will first adopt a more 
compact way of representing the conditional probabilities. We will employ the technique that was used 
in Section 13.4 for the mixture modeling case. Let us denote each latent variable as a K -dimensional 
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vector, jc„ e S K , n — 1,2 ..... /V, whose elements are all zero except at the kth location, where k is the 
index of the (unknown) state from which y n has been emitted, that is, 


j [ %n,i — 0) i 7 ^ k 

Xii = I-F/. 11 %n,2> • • •) %n,K\ ■ i 

= 1 - 


Then we can compactly write 

K 

P( Xl ) = Y\P* lk , (16.32) 

k=1 


P(x n |x„-i) = n n r u " jXr "’ ■ (!6.33) 

i=lj=l 

Indeed, if the jump is from a specific state j at time n — 1 to a specific state i at time n , then the 
only term that survives in the previous product is the corresponding factor, P, ; . The joint probability 
distribution of the complete set, as a direct consequence of the Bayesian network model of Fig. 16.15, 
is written as 

N 

p(Y, X ) = P(xi)^(y x |xi) Y\ P(Xn\x n -\)p(y n \x n ), (16.34) 

n=2 


where 


K 

p(y„\x„) = ]~[ (p(y n \k; 0 k )) Xn ' k . 
k= i 


(16.35) 


The corresponding junction tree is trivially obtained from the graph in Fig. 16.15. Replacing directed 
links with undirected ones and considering eliques of size two, the graph in Fig. 16.19A results. 
However, as all the y n variables are observed ( instantiated ) and no marginalization is required, their 



FIGURE 16.19 

(A) The junction tree that results from the graph of Fig. 16.15. (B) Because y n are observed, their effect is only of a 
multiplicative nature (no marginalization is involved) and its contribution can be trivially absorbed by the potential 
functions (distributions) associated with the latent variables. 











16.5 HIDDEN MARKOV MODELS 849 


multiplicative contribution can be absorbed by the respective conditional probabilities, which leads to 
the graph of Fig. 16.19B. Alternatively, this junction tree can be obtained if one considers the nodes 
(xi, y] ) and (x„_i, x„, y„), n = 2, 3,..., N, to form eliques associated with the potential functions 

fi(xi,y{) = -P(*i)F(J 1 l*i) (16.36) 

and 

y„’ x n) — P(x n \x„-i)p(y n \x n ), n = 2,..., N. (16.37) 

The junction tree of Fig. 16.19B results by eliminating nodes from the eliques starting from X|. Note 
that the normalizing constant is equal to one, Z = 1. 

To apply the sum-product rule for junction trees, Eq. (16.5) now becomes 

FV 7 ,,— >S„ (.Xn) = ^ ' l/ r B 1 > yn > X n )P-S n -i—*V n (Xn— l) 

Xn —1 

= X! ^S n - { ^V n (Xn-l)P(X n \x n -l)p(y n \x n ). 

Xn —1 


Also, 


Thus, 


with 


HS n -i^V n (x n -l) = 


(16.38) 


l-t-V„^sJXn) = Hv^i^Sn-iiXn-^PiXn^n-^piyJXn), (16.39) 


X n -1 


HVi^Si(xi) = P(x\)p(yi\xi). (16.40) 

In the HMM literature, it is common to use the “alpha” Symbol for the exchanged messages, that is, 

a(x n ):= n Vn ^s„(x n ). (16.41) 

If one considers that the message passing terminates at a node V n , then based on (16.7), and taking into 
account that the variables y j,..., y n are clumped to the observed values (recall the related comment 
following Eq. (15.44)), it is readily seen that 


a(x n ) = p(y u y 2 , , y„,x„), (16.42) 

which can also be deduced by the respective definitions in (16.39) and (16.40); all hidden variables, 
except x„, have been marginalized out. This is a set of K probability values (one for each value of x n ). 
For example, for x n : x n j, = 1, u(x„) is the probability of the trajectory to be at time n at state k and 
having obtained the specific observations up to and including time n. From (16.42), one can readily 
obtain the joint probability distribution (evidence) over the observation sequence, comprising N time 
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instants, that is, 


p(Y)=J2p(yi'y2’- 

•, y n , x n) = : evidence of observations, 

X N 

XN 


which, as said in the beginning of the section, is a quantity used for classification/recognition. 

In the signal processing “jargonthe computation of a(x n ) is referred to as th efiltering recursion. 
By the definition of a(x„), we have [2] 


a(x n ) = p(y„\x„)- E a(x n -i)P(x n \x„-i ): hltering recursion. 
corrector x " 1 

' -V*- 

predictor 


(16.43) 


As is the case with the Kalman filter, to be treated in Chapter 17, the only difference there will be that 
the summation is replaced by integration. Having adopted Gaussian distributions, these integrations 
translate into updates of the respective mean values and covariance matrices. The physical meaning of 
(16.43) is that the predictor provides a prediction on the state using ali the past information prior to n. 
Then this information is corrected based on the observation y n , which is received at time n. Thus, the 
updated information, based on the entire observation sequence up to and including the current time n , 
is readily available by 


B(X„|F[1;„]) 


«(*«) 

P(Y[Un])’ 


where the denominator is given by a(x n ), and Y\\ n \ := (jq,..., y n ). 

Let us now carry on with the second message passing phase, in the opposite direction than before, 
in order to obtain 


/ x V„ + i (Xn) — ) ' PS n+ (■^n+l)P('*n+l \ x ri)p(y n +\ l*n+l)j 


Vs n+l ^v n+l (x„+ 1 ) = pv n+2 -+s n+1 (x n+ i). 

Hence, 

MV n+1 ^S„ (X n ) = J2 Wn+ 2 -+S n +1 (X, 1+ l)P(X„ + l | X n )p(y n+ i |*„+l), (16.44) 

x n +1 


with 


>Sn (-^a^) — 1* (16.45) 

Note that /i v„ + i ^ S„ (x n ) involves K values and for the computation of each one of them K summations 
are performed. So, the complexity scales as 0(K 2 ) per time instant. In the HMM literature, the Symbol 
“beta” is used, 


fi(x n ) — >.V,;(-G). 


(16.46) 
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From the recursive definition in (16.44) and (16.45), where x n+ \. x„ + 2 , ■... x ...y have been marginal- 
ized out, we can equivalently write 

P(x n ) = p(y„+\, y n + 2 ’ yN\x n ). (16.47) 

That is, conditioned on the values of x n , for example, x„ : x„k = 1, fi(x n ) is the value of the joint 
distribution for the observed values, y n+ \, .. ., y to be emitted when the system is at state k at 
time n. 

We have now all the “ingredients” in order to compute marginals. From (16.7), we obtain (explain 
it based on the independence properties that underlie an HMM) 

P(Xn i y 11 3^2’ ■ • • ’ y n) = F-Vn-i— >S„ i.X n )flv n +i —>S n (.Xn) 

= a(x n )P(x„), (16.48) 


which in turns leads to 


y(x„) := P(x n \Y) = 


a(x„)P(x„) _ 
P(Y) 


smoothing recursion. 


(16.49) 


This part of the recursion is known as the smoothing recursion. Note that in this computation, both past 
(via a(x„)) and future (via P(x n )) data are involved. 

An alternative way to obtain y(x„) is via its own recursion together with a(x„), by avoiding fi(x n ) 
(Problem 16.16). In such a scenario, both passing messages are related to densities with regard to x n , 
which has certain advantages for the case of linear dynamic systems. 

Finally, from (16.6) and recalling (16.38), (16.41), and (16.46), we obtain 

p(x n - i, Xn , Y) = P(x n \x n -l)p(y n | X„)MS„^ V„ (x„)P-S n - i^V n (x„-l) 

= a(x n -i)P(x n \x„-i)p(y ll \x n )P(x n ), (16.50) 


or 


, a(Xn-l)P(Xn\x„-l)piy n \Xn)P(X n ) 

p(x„-i,x„\Y) = -—- 

P(Y ) 

: =$(xn-i,x„). (16.51) 

Thus, $(■, •) is a table of K 2 probability values. Let f(x„_i j,x n j) correspond to x n -\j = x n j = 1. 
Then f(x„_i j,x„j) is the probability of the system being at States j and i at times n — 1 and n, 
respectively, conditioned on the transmitted sequence of observations. 

In Section 15.7.4 a message passing scheme was proposed for the efficient computation of the max¬ 
imum of the joint distribution. This can also be applied in the junction tree associated with an HMM. 
The resulting algorithm is known as the Viterbi algorithm. The Viterbi algorithm results in a straight- 
forward way from the general max-sum algorithm. The algorithm is similar to the one derived before; 
all one has to do is to replace summations with the maximum operations. As we have already com- 
mented, while discussing the max-product rule, computing the sequence of the complete set (y n , x n ), 
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n = 1,2 ,N, that maximizes the joint probability, using back-tracking, equivalently defines the op- 
timal trajectory in the two-dimensional grid. 

Another inference task that is of interest in practice, besides recognition, is prediction, that is, given 
an HMM and the observation sequence y n , n = 1,2,..., N, to optimally predict the value y n+1 . This 
can also be performed efficiently by appropriate marginalization (Problem 16.17). 


16.5.2 LEARNING THE PARAMETERS IN AN HMM 


This is the second time we refer to the learning of graphical models. The first time was at the end 
of Section 16.3.2. The most natural way to obtain the unknown parameters is to maximize the likeli- 
hood/evidence of the joint probability distribution. Because our task involves both observed and latent 
variables, the EM algorithm is the first one that comes to mind. However, the underlying independen- 
cies in an HMM will be employed in order to come up with an efficient learning scheme. The set of the 
unknown parameters, 0, involves (a) the initial state probabilities, /\, k — 1,..., K, (b) the transition 
probabilities, Pij , i, j = 1,2,..., K, and (c) the parameters in the probability distributions associated 
with the observations, 6^, k = 1,2 ,,K. 

Expectation step: From the general scheme presented in Section 12.4.1 (with Y in place of X, X in 
place of X 1 , and © in place of £) at the (t + l)th iteration, we have to compute 


Q(0, © (0 ) = E [ln p(Y, X; 0)], 


where E[-] is the expectation with respect to P(X\Y\ 0^). From (16.32)— ( 16.35) we obtain 


K 


ln p(Y,X\@) = ^(x u . InPk + lnpiy^Ok)) 


k=l 

N K K 


+ I' 1 Pij 


n=2 i =1 j —1 
N K 



n=2 k=\ 


thus. 


K N K K 


Q(0,© (O ) = E E [xi,*] in p k +E E E E t x ',-ijvii in p<j 


A:=l n=2 i =1 j —1 

N K 


+EE E M ln Kj„i^ ; Qk)- 


(16.52) 


n =1 k= 1 


Let us now recall (16.49) to obtain 


E[x„,yt] = ^ P ( Xn l F; 0< °) X nT = E y ( Xn ' ® (n ) x n,k- 


x> 


Xt 
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Note that x„j L can either be zero or one; hence, its mean value will be equal to the probability that x„ 
has the fcth element x n j, = 1 and we denote it as 

= y (*«.* = 1;© (0 ). (16.53) 

Recall that given 0^, )/(•; 0^) can be efficiently computed via the sum-product algorithm described 
before. In a similar spirit and mobilizing the definition in (16.51), we can write 

l,j^n,i\ = ^ ^ ^ ^ P(%n 5 %n— 11 ^)x n —\jX n j 

x n x n _i 

= ^ ^ ^ £ (%n ? %n— 1 ? ® ^ l,jXn,i 

X n X n —\ 

= i;(x n -ij = l,x n j = 1; 0 (f) ). (16.54) 

Note that £(•, •; 0^) can also be efficiently computed as a by-product of the sum-product algorithm, 
given ©«. Thus, we can summarize the E-step as 

K 

Q(0, © (r) ) = Y,y ( x u = l; © (f) ) ln Pk 

k= 1 

N K K 

+ J2J2 E = 1 ,*n,i = Ii © ( °) ln Pij 

n=2 i =1 7 = 1 
N K 

+ E = 1; © (?) ) ln p(j„ |*; ffi). (16.55) 

n= 1 &=1 

Maximization step: In this step, it suffices to obtain the derivatives/gradients with regard to l\, Pij, 
and Ok and equate them to zero in order to obtain the new estimates, which will comprise Q ( ' +I E Note 
that Pk and Pij are probabilities; hence, their maximization should be constrained so that 

K K 

2> = 1 and J2 Pi J = 1’ J = h2,...K. 

k=l /=1 

The resulting reestimation formulas are (Problem 16.18) 


p( r +t) 

r k 

p(t+ 0 

ij 


Y(xi, k = l\® {,) ) 

£f=ir(*u = i;© ( °)’ 

= 1 ’ Xnj = 1; Q (0 ) 

J2n=2 t^ x «“1,7 = 1> x n,k = lj © (,) ) 


(16.56) 

(16.57) 


The reestimation of Ok depends on the form of the corresponding distribution p(y n \ k; Ok). For example, 
in the Gaussian scenario, the parameters are the mean values and the elements of the covariance matrix. 
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In this case, we obtain exactly the same iterations as those resulting for the problem of Gaussian 
mixtures (see Eqs. (12.60) and (12.61)), if in place of the posterior we use y. 

In summary, training an HMM comprises the following steps: 

1. Initialize the parameters in 0. 

2. Run the sum-product algorithm to obtain y(-) and §(•, •), using the current set of parameter esti- 
mates. 

3. Update the parameters as in (16.56) and (16.57). 

Iterations in steps 2 and 3 continue until a convergence criterion is met, such as in EM. This iterative 
scheme is also known as the Baum-Welch or forward-backward algorithm. Besides the forward- 
backward algorithm for training HMMs, the literature is rich in a number of alternatives with the 
goal of either simplifying computations or improving performance. For example, a simpler training al¬ 
gorithm can be derived tailored to the Viterbi scheme for computing the optimum path (e.g., [63,72]). 
Also, to further simplify the training algorithm, we can assume that our state observation variables, y n , 
are discretized (quantized) and can take values from a finite set of L possible ones, [1,2,..., L). This 
is often the case in practice. Furthermore, assume that the first state is also known. This is, for example, 
the case for left-to-right models like the one shown in Fig. 16.17. In such a case, we need not compute 
estimates of the initial probabilities. Thus, the unknown parameters to be estimated are the transition 
probabilities and the probabilities P y {r\i), r = 1, 2,..., L, i = 1,2,.... K , that is, the probability of 
emitting Symbol r from state i. 

Viterbi reestimation: The goal of the algorithm is to obtain the best path and compute the associated 
cost, say, D , along the path. In the speech literature, the algorithm is also known as the segmental 
k-means training algorithm [63]. 

Definitions: 

• tii\ j := number of transitions from state j to state i. 

• n.\j := number of transitions originated from state j. 

• n,|. := number of transitions terminated at state i. 

• n(r\i ) := number of times observation r e [1,2,..., L} occurs jointly with state i. 

Iterations: 

• Initial conditions: Assume the initial estimates of the unknown parameters. 

• Step 1: From the available best path, reestimate the new model parameters as 


P (new) (j | j ) = H}1 ! 

n \j 

D (new ), ... n ( r \i) 

Px (r\0 = -■ 

«i|. 

• Step 2: For the new model parameters, obtain the best path and compute the corresponding 
overall cost Z) (new E Compare it with the cost D of the previous iteration. If Z)( new ) — D > e, 
set D — /) ,ncw > and go to step 1. Otherwise stop. 

The Viterbi reestimation algorithm can be shown to converge to a proper characterization of the under- 
lying observations [14]. 
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Remarks 16.4. 

• Scaling: The probabilities a and /1, being less than one, as iterations progress can take very small 
values. In practice, the dynamic range of their computed values may exceed that of the computer. 
This phenomenon can be efficiently dealt with within an appropriate scaling. If this is done properly 
on both a and /3, then the effect of scaling cancels out [63]. 

• Insufficient training data set : Generally, a large amount of training data is necessary to learn the 
HMM parameters. The observation sequence must be sufficiently long with respect to the number 
of States of the HMM model. This will guarantee that ali state transitions will appear a sufficient 
number of times, so that the reestimation algorithm learns their respective parameters. If this is not 
the case, a number of techniques have been devised to cope with the issue. For a more detailed 
treatment, the reader may consuit [8,63] and the references therein. 

16.5.3 DISCRIMINATIVE LEARNING 

Discriminative learning is another path that has attracted a lot of attention. Note that the EM algorithm 
optimizes the likelihood with respect to the unknown parameters of a single HMM in “isolation”; that 
is, without considering the rest of the HMMs, which model the other words (in the case of speech 
recognition) or other templates/prototypes that are stored in the database. Such an approach is in line 
with what we defined as generative learning in Chapter 3. In contrast, the essence of discriminative 
learning is to optimize the set of parameters so that the models become optimally discriminated over 
the training sets (e.g., in terms of the error probability criterion). In other words, the parameters de- 
scribing the different statistical models (HMMs) are optimized in a combined way, not individually. 
The goal is to make the different HMM models as distinet as possible, according to a criterion. This has 
been an intense line of research and a number of techniques have been developed around criteria that 
lead to either convex or nonconvex optimization methods (see, for example, [33] and the references 
therein). 

Remarks 16.5. 

• Besides the basic HMM scheme, which was described in this section, a number of variants have 
been proposed in order to overcome some of its shortcomings. For example, alternative modeling 
paths concern the first-order Markov property and propose models to extend correlations to longer 
times. 

In the autoregressive HMM [11], links are added among the observation nodes of the basic HMM 
scheme in Fig. 16.15; for example, y„ is not only linked to x„, but it shares direct links with, for 
example, y n - 2 , y«-i, y«+i , and y„+ 2 , if the model extends correlations up to two time instants away. 
A different concept has been introduced in [56] in the context of segment modeling. According 
to this model, each state is allowed to emit, say, d successive observations, which comprise a 
segment. The length of the segment, d , is itself a random variable and it is associated with a proba¬ 
bility P(d\k), k — 1,2,..., K. In this way, correlation is introduced via the joint distribution of the 
samples comprising the segment. 

• Variable duration HMM: A serious shortcoming of the HMMs, which is often observed in practice, 
is associated with the self-transition probabilities, P{k\k), which are among the model parameters 
associated with an HMM. Note that the probability of the model being at state k for d successive 
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FIGURE 16.20 

A factorial HMM with three chains of hidden variables. 


instants (initial transition to the state and d — 1 self-transitions) is given by 

P k {d) = {P(k\k)) d ~\\-P(k\k)), 

where 1 — P(k\k) is the probability of leaving the state. For many cases, this exponential state dura- 
tion dependence is not realistic. In variable-duration HMMs, Pk(d) is explicitly modeled. Different 
models for Pk(d) can be employed (see, e.g., [46,68,72]). 

• Hidden Markov modeling is among the most powerful tools in machine learning and has been 
widely used in a large number of applications besides speech recognition. Some sampled references 
are [9] in bioinformatics, [16,36] in Communications, [4,73] in optical character recognition (OCR), 
and [40,61,62] in music analysis/recognition, to name but a few. For a further discussion on HMMs, 
see, for example, [8,64,72]. 


16.6 BEYOND HMMS: A DISCUSSION 

In this section, some notable extensions of the hidden Markov models, which were previously dis- 
cussed, are considered in order to meet requirements of applications where either the number of States 
is large or the homogeneity assumption is no more justified. 

16.6.1 FACTORIAL HIDDEN MARKOV MODELS 

In the HMMs considered before, the system dynamics is described via the hidden variables, whose 
graphical representation is a chain. However, such a model may turn out to be too simple for certain 
applications. A variant of the HMM involves M chains, instead of one chain, where each chain of 
hidden variables unfolds in time independently of the others. Thus at time n, M hidden variables are 
involved, denoted as x "' 1 , m = 1,2,..., M [17,34,81]. The observations occur as a combined emission 
where ali hidden variables are involved. The respective graphical structure is shown in Fig. 16.20 for 
M — 3. Each one of the chains develops on its own, as the graphical model suggests. Such models are 
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known as factorial HMMs (FHMMs). One obvious question is, why not use a single chain of hidden 
variables by increasing the number of possible States? It tums out that such a naive approach would 
blow up complexity. Take as an example the case of M — 3, where for each one of the hidden variables, 
the number of States is equal to 10. The table of transition probabilities for each chain requires 10 2 
entries, which amounts to a total number of 300, i.e., P^, i, j = 1,2,..., 10, m = 1,2, 3. Moreover, 
the total number of state combinations which can be realized is 10 3 = 1000. To implement the same 
number of States via a single chain one would need a table of transition probabilities equal to (10 3 ) 2 = 
10 6 ! 

Let X„ be the M-tuple (x^\ ..., where each xj, m> has only one of its elements equal to 

1 (indicating a state) and the rest are zero. Then 


M 

P(Xn\%,-l) = n p(m) ■ 

m =1 


In [17], the Gaussian distribution was employed for the observations, that is, 


p(y„ |X„) = A/' 



M 


M (m) x { ™\ 

m= 1 



(16.58) 


where 


M (m) = 




m = 1,2,, M, 


(16.59) 


are the matrices comprising the mean vectors associated with each state and the covariance matrix is 
assumed to be known and the same for ali. The joint probability distribution is given by 


n=2 


P&l ... X A r,Y)= n p(m) ( X T ] ) n pim) |J 4-1) 

m= 1 \ 

N 

n p(y n \&n). 


(16.60) 


n =1 


The challenging task in factional HMMs is complexity. This is illustrated in Fig. 16.21, where the 
explosion in the size of eliques after performing the moralization and triangulation steps is readily 
deduced. 

In [17], the variational approximation method is adopted to simplify the structure. However, in 
contrast to the complete factorization scheme, which was adopted for the approximating distribution, 
Q, in Section 16.3.2 for the Boltzmann machine (corresponding to the removal of ali edges in the 
graph), here the approximating graph will have a more complex structure. Only the edges connected to 
the output nodes are removed; this results in the graphical structure of Fig. 16.22, for M — 3. Because 
this structure is tractable, there is no need for further simplihcations. The approximate conditional 
distribution, Q , of the simplified structure is parameterized in terms of a set of variational parameters, 
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FIGURE 16.21 

The graph resulting from a factorial HMM with three chains of hidden variables, after the moralization (linking 
variables in the same time instant) and triangulation (linking variables between neighboring time instants) steps. 
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FIGURE 16.22 

The simplified graphical structure of an FHMM comprising three chains used in the framework of variational 
approximation. The nodes associated with the observed variables are delinked. 


A, 1 ," 1 (one for each delinked node), and it is written as 

M 


<2(Xi.. .Xjv|r; X) = m P (m) (*F°) fi p(M) . (16.61) 


m =1 


n=2 


where. 


5(»0 


(x^\ X ( n\) = P ™ (4 m) l(^-l)^ m) . 


m — 2,..., M, n = 1, 2,..., N, 


and 


P (1) (jc 1 ) = / >(1) (jci)kj" l) . 
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The variational parameters are estimated by minimizing the KL distance between Q and the conditional 
distribution associated with (16.60). This compensates for some of the information loss caused by the 
removal of the observation nodes. The optimization process renders the variational parameters interde- 
pendent; this (deterministic) interdependence can be viewed as an approximation to the probabilistic 
dependence imposed by the exact structure prior to the approximation. 

16.6.2 TIME-VARYING DYNAMIC BAYESIAN NETWORKS 

Hidden Markov as well as factorial hidden Markov models are homogeneous; hence, both the struc¬ 
ture and the parameters are fixed throughout time. However, such an assumption is not satisfying for a 
number of applications where the underlying relationships as well as the structural pattern of a system 
undergoes changes as time evolves. For example, the gene interactions do not remain the same through¬ 
out life; the appearance of an object across multiple cameras is continuously changing. For systems that 
are described by parameters whose values are varying slowly in an interval, we have already discussed 
a number of alternatives in previous chapters. The theory of graphical models provides the tools to 
study systems with a mixed set of parameters (discrete and continuous); also, graphical models lend 
themselves to modeling of nonstationary environments, where step changes are also involved. 

One path toward time-varying modeling is to consider graphical models of fixed structure but with 
time-varying parameters, known as switching linear dynamic systems (SLDSs). Such models serve 
the needs of systems in which a linear dynamic model jumps from one parameter setting to another; 
hence, the latent variables are both of discrete and of continuous nature. At time instant n, a switch 
discrete variable, s„ e {1, 2,..., M}, selects a single LDS from an available set of M (sub)systems. 
The dynamics of s„ is also modeled to comply with the Markovian philosophy, and transitions from 
one LDS to another are governed by P(s„ |s„_i). This problem has a long history and its origins can be 
traced back to the time just after the publication of the seminal paper by Kalman [35]; see, for example, 
[18] and [2,3] for a more recent review of related techniques concerning the approximate inference task 
in such networks. 

Another path is to consider that both the structure as well as the parameters change over time. One 
route is to adopt a quasistationary rationale, and assume that the data sequence is piece-wise stationary 
in time (for example, [12,54,66]). Nonstationarity is conceived as a Cascade of stationary models, which 
have previously been learned by presegmented subintervals. The other route assumes that the structure 
and parameters are continuously changing (for example, [42,79]). An example for the latter case is a 
Bayesian network where the parents of each node and the parameters, which define the conditional 
distributions, are time-varying. A separate variable is employed that defines the structure at each time 
instant; that is, the set of linking directed edges. The concept is illustrated in Fig. 16.23. The method 
has been applied to the task of active camera tracking [79]. 


16.7 LEARNING GRAPHICAL MODELS 

Learning a graphical model consists of two parts. Given a number of observations, one has to specify 
both the graphical structure and the associated parameters. 




860 


CHAPTER 16 PROBABILISTIC GRAPHICAL MODELS: PART II 




FIGURE 16.23 

The figure corresponds to a time-varying dynamic Bayesian network with two variables x/, and x%, n = 1, 2,_ 

The parameters controlling the conditional distributions are considered as separate nodes, and respectively, 
for the two variables. The structure variable, G„, Controls the values of the parameters as well as the structure of the 
network, which is continuously changing. 


16.7.1 PARAMETER ESTIMATION 

Once a graphical model has been adopted, one has to estimate the unknown parameters. For example, 
in a Bayesian network involving discrete variables one has to estimate the values of the conditional 
probabilities. In Section 16.5, the case of learning the unknown parameters in the context of an HMM 
was presented. The key point was to maximize the joint PDF over the observed output variables. This 
is among the most popular criteria used for parameter estimation in different graphical structures. In 
the HMM case, some of the variables were latent, and hence the EM algorithm was mobilized. If ali the 
variables of the graph can be observed, then the task of parameter learning becomes a typical maximum 
likelihood one. More specifically, consider a network with I nodes representing the variables xi,..., x/, 
which are compactly written as a random vector x. Let also jc i, X 2 , ■ ■ ., x^ be a set of observations; 
then 

0 — argmaXfl p(xi,X2, ■ ■ ■ ,xn\ 0), 

where 0 comprises ali the parameters in the graph. If latent variables are involved, then one has to 
marginalize them out. Any of the parameter estimation techniques that were discussed in Chapters 12 
and 13 can be used. Moreover, one can take advantage of the special structure of the graph (i.e., the 
underlying independencies) to simplify computations. In the HMM case, its Bayesian network structure 
was exploited by bringing the sum-product algorithm into the game. Besides maximum likelihood, one 
can adopt any other method related to parameter estimation/inference. For example, one can impose a 
prior p(0 ) on the unknown parameters and resort to a MAP estimation. Moreover, the full Bayesian 
scenario can also be employed, and we assume the parameters to be random variables. Such a line 
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presupposes that the unknown parameters have been included as extra nodes to the network, linked 
appropriately to those of the variables that they affect. As a matter of fact, this is what we did in 
Fig. 13.2, although there, we had not talked about graphical models yet (see also Fig. 16.24). Note that 
in this case, in order to perform any inference on the variables of the network one should marginalize 
out the parameters. For example, assume that our / variables correspond to the nodes of a Bayesian 
network, where the local conditional distributions. 


p(xi|Pa;;0;)> i'= 1,2./, 


depend on the parameters 0,. Also, assume that the (random) parameters 0,, i = 1,2,...,/, are mutu- 
ally independent. Then the joint distribution over the variables is given by 



Using convenient priors, that is, conjugate priors, computations can be significantly facilitated; we have 
demonstrated such examples in Chapters 12 and 13. 

Besides the previous techniques, which are offspring of the generative modeling of the underlying 
processes, discriminative techniques have also been developed. 



FIGURE 16.24 


An example of a Bayesian network, where new nodes associated with the parameters have been included in order to 
treat parameters as random variables, as required by the Bayesian parameter learning approach. 


y 



FIGURE 16.25 

The Bayesian network associated with the naive Bayes classifier. The joint PDF factorizes as p(y , x \,..., xi) = 


p(y) flL P(*i\y)- 
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In a general setting, let us consider a pattern recognition task where the (output) label vari- 
able y and the (input) feature variables X|,..., X/ are jointly distributed according to a distribution 
that can be factorized over a graph, which is parameterized in terms of a vector parameter 0, i.e., 
p(y, xi , X 2 , ■ ■ ., xi; 0) := p(y, x\ 0) [ il 3]. A typical example of such modeling is the naive Bayes clas- 
sifier, which was discussed in Chapter 7, whose graphical representation is given in Fig. 16.25. For a 
given set of training data, (y„, x n ), n — 1, 2, ..., N, the log-likelihood function becomes 

N 

L(Y , X\ 0) — 'y' hi p(y n ,x n \ 0). (16.62) 

n= 1 


Estimating 0 by maximizing L(-,-;0), one would obtain an estimate that guarantees the best (according 
to the maximum likelihood criterion) fit of the corresponding distribution to the available training set. 
However, our ultimate goal is not to model the generation “mechanism” of the data. Our ultimate goal 
is to classify them correctly. Let us rewrite (16.62) as 

N N 

L(Y,X ; 0) P(yn \Xn> "T Itl p{x n , 0). 

n= 1 /i=l 


Getting biased toward the classification task, it is more sensible to obtain 0 by maximizing the first of 
the two terms only, that is. 


N 

L c (Y,X-0) = ^lnF(y„|jr„;0) 

n= 1 

\np(y n ,x n ;0) - In 

n= 1 \ 

where the summation over y n is over ali possible values of y n (classes). This is known as the condi- 
tional log-likelihood (see, for example [19,20,67]). The resulting estimate, 0, guarantees that overall 
the posterior class probabilities, given the feature values, are maximized over the training data set; 
after ali, Bayesian classification is based on selecting the class of x according to the maximum of the 
posterior probability. However, one has to be careful. The price one pays for such approaches is that 
the conditional log-likelihood is not decomposable, and more sophisticated optimization schemes have 
to be mobilized. Maximizing the conditional log-likelihood does not guarantee that the error probabil¬ 
ity is also minimized. This can only be guaranteed if one estimates 0 so as to minimize the empirical 
error probability. However, such a criterion is hard to deal with, as it is not differentiable; attempts to 
deal with it by using approximate smoothing functions or hili climbing greedy techniques have been 
proposed (for example, [58] and the references therein). Note that the rationale behind the conditional 
log-likelihood is closely related to that behind conditional random fields, discussed in Section 15.4.3. 

Another route in discriminative learning is to obtain the estimate of 0 by maximizing the margin. 
The probabilistic class margin (for example, [21,59]) is defined as 



'J2,P(yn,x n \0) j , (16.63) 

y n / 
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, . P(y n \x„;0) P(y n \x n ;0) 

d n = mm-=- 

y+yn P(y\x„\0) mzx y ^ yn P(y\x n \0) 

p(y ni x n ; 0) 

m'dXyjty n p(y, x n ; 0 )' 

The idea is to estimate 0 so as to maximize the minimum margin over ali training data, that is, 

0 — argmax^ m\n{d \, di, ... , diy). 

The interested reader may also consuit [10,55,60] for related reviews and methodologies. 

Example 16.4. The goal of this example is to obtain the values in the conditional probability table 
in a general Bayesian network, which consists of / discrete random nodes/variables, xj, X 2 ,..., x/. 
We assume that all the involved variables can be observed and we have a training set of N observa- 
tions. The maximum likelihood method will be employed. Let Xj(ri), n = 1,2,.... /V, denote the «tli 
observation of the /th variable. 

The joint PDF under the Bayesian network assumption is given by 

l 

P{XU ■ ■ ■ , xi) — ]""[ P(.Xj | Pa,; 0,), 

i =l 


and the respective log-likelihood is 


N 1 

L(X\ 0) =E£^ (x, («)|Pa, (n); 0,). 

n= 1 i =1 

Assuming 0/ to be disjoint with 0i A j, optimization over each 0,, i = 1,2,... ,1, can take place 
separately. This property is referred to as the global decomposition of the likelihood function. Thus, 
it suffices to perform the optimization locally on each node, that is, 

N 

1(0,) — ^^lnP(x,(n)|Pa ; (n); 0j), i =1,2,...,/. (16.64) 

n =1 

Let us now focus on the case where all the involved variables are discrete, and the unknown quantities 
at any node i are the values of the conditional probabilities in the respective conditional probability 
table. For notational convenience, denote as h , the vector comprising the state indices of the parent 
variables of x,-. Then the respective (unknown) probabilities are denoted as P Xl \h, (xj , h,), for all pos- 
sible combinations of values of x, and h For example, if all the involved variables are binary and 
x/ has two parent nodes, then P Xl h : (xi , hj) can take a total of eight values that have to be estimated. 
Eq. (16.64) can now be rewritten as 

1(0i) = '^2y^ y s(x i ,hj)\n P Xi \ h . (xj,hj), ( 16 . 65 ) 

hj 

where s(Xj, hj) is the number of times the specific combination of (x,- , hj) appeared in the N samples 
of the training set. We assume that N is large enough so that all possible combinations occurred at 
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least once, that is, s (x,-, hj) ^ 0, V(x,-, h{). All one has to do now is to maximize (16.65) with respect 
to P Xi \hi (•> •)> taking into account that 


'Y^Px i \hMi,h i ) = 1 . 

Xi 


Note that P Xl \h, are independent for different values of hi. Thus, maximization of (16.65) can take 
place separately for each hj, and it is straightforward to see that 


Pxi\hi — 


s(xj,hj) 
J2 Xj s(xi,hj)' 


(16.66) 


In words, the maximum likelihood estimate of the unknown conditional probabilities complies with 
our common sense; given a specific combination of the parent values, lij, Px,\h, is approximated by 
the fraction of times the specific combination ( x ,, hi) appeared in the data set, of the total number of 
times hi occurred (relate (16.66) to the Viterbi algorithm in Section 16.5.2). One can now see that in 
order to obtain good estimates, the number of training points, N, should be large enough so that each 
combination occurs a sufficiently large number of times. If the average number of parent nodes is large 
and/or the number of States is large, this poses heavy demands on the size of the training set. This is 
where parameterization of the conditional probabilities can pro ve to be very helpful. 


16.7.2 LEARNING THE STRUCTURE 

In the previous subsection, we considered the structure of the graph to be known and our task was to 
estimate the unknown parameters. We now turn our attention to learning the structure. In general, this 
is a much harder task. We only intend to provide a sketch of some general directions. 

One path, known as constrained-based, is to try to build up a network that satisfies the data inde- 
pendencies, which are “measurcd’' using different statistical tests on the training data set. The method 
relies a lot on intuition and such methods are not particularly popular in practice. 

The other path comes under the name of s-based methods. This path treats the task as a typical 
model selection problem. The score that is chosen to be maximized provides a tradeoff between model 
complexity and accuracy of the fit to the data. Classical model fitting criteria such as Bayesian Infor¬ 
mation criterion (BIC) and minimum description length (MDL) have been used, among others. The 
main difficulty with all these criteria is that their optimization is an NP-hard task and the issue is to 
find appropriate approximate optimization schemes. 

The third main path draws its existence from the Bayesian philosophy. Instead of a single structure, 
an ensemble of structures is employed by embedding appropriate priors into the problem. The readers 
who are interested in a further and deeper study are referred to more specialized books and papers (for 
example, [22,41,53]). 


PROBLEMS 

16.1 Prove that an undirected graph is triangulated if and only if its eliques can be organized into a 
join tree. 
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FIGURE 16.26 

The graph for Problem 16.6. 

A B E 



FIGURE 16.27 

The Bayesian network structure for Problem 16.7. 


16.2 For the graph of Fig. 16. 3A, give all possible perfect elimination sequences and draw the re- 
sulting sequence of graphs. 

16.3 Derive the formulas for the marginal probabilities of the variables in (a) a clique node and (b) 
in a separator node in a junction tree. 

16.4 Prove that in a junction tree the joint PDF of the variables is given by Eq. (16.8). 

16.5 Show that obtaining the marginal over a single variable is independent of which one from the 
clique/separator nodes that contain the variable the marginalization is performed. 

Hint: Prove it for the case of two neighboring clique nodes in the junction tree. 

16.6 Consider the graph in Fig. 16.26. Obtain a triangulated version of it. 

16.7 Consider the Bayesian network structure given in Fig. 16.27. Obtain an equivalent join tree. 

16.8 Consider the random variables A, B, C, D, E, F, G, H, I, J and assume that the joint distribution 
is given by the product of the following potential functions: 

1 

p = - B, C, D)ih(B, E, D)jjf 3 (E, D, F, I)^ 4 (C, D, G)Vq>(C, H, G, /). 

Construet an undirected graphical model on which the previous joint probability factorizes and 
in the sequence derive an equivalent junction tree. 
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16.9 Prove that the function 


g(x) = 1 — exp(—x), x > 0, 


is log-concave. 

16.10 Derive the conjugate function of 


f(x) — ln(l - exp(-x)). 


16.11 Show that minimizing the bound in (16.12) is a convex optimization task. 

16.12 Show that the function 1 + exp(—x), x e R, is log-convex and derive the respective conjugate 


one. 

16.13 Derive the KL divergence between P(X l \X) and QiX 1 ) for the mean field Boltzmann machine 


and obtain the respective l variational parameters. 

16.14 Given a distribution in the exponential family 



show that A(0) generates the respective mean parameters that define the exponential family 



Also, show that A{6) is a convex function. 

16.15 Show that the conjugate function of A(6), associated with an exponential distribution such as 
that in Problem 16.14, is the corresponding negative entropy function. Moreover, if // and 0(fi ) 
are doubly coupled, then 


fi = E [ m ( x )], 


where E[-] is with respect to p(x\ 

16.16 Derive a recursion for updating y (jc„) in HMMs independent of ( J >(x n ). 

16.17 Derive an efficient scheme for prediction in HMM models, that is, to obtain p(y N+ \ |7), where 

16.18 Prove the estimation formulas for the probabilities Pk, k = l,2,...,K, and P, j , i, j = 
1,2,..., K, in the context of the forward-backward algorithm for training HMMs. 

16.19 Consider the Gaussian Bayesian network of Section 15.3.5 defined by the local conditional 


PDFs 


p(xi |Pa ( ) =N I xi ^2 0 ik x k + OiQ, o 2 \, i — 1,2./. 

\ fcj^ePa,- / 


Assume a set of N observations, x; (n), n = 1,2 ,N,i = 1,2,...,/, and derive a maximum 
likelihood estimate of the parameters 6\ assume the common variance a 2 to be known. 
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17.1 INTRODUCTION 

This chapter is a follow-up to Chapter 14, whose focus was on Monte Carlo methods. Our interest now 
turns to a special type of sampling techniques known as sequential-sampling methods. In contrast to 
the Monte Carlo methods, considered in Chapter 14, here we will assume that distributions from which 
we want to sample are time-varying, and that sampling will take place in a sequential fashion. The 
main emphasis of this chapter is on particle filtering techniques for inference in state-space dynamic 
models. In contrast to the classical form of Kalman filtering, here the model is allowed to be nonlinear 
and/or the distributions associated with the involved variables non-Gaussians. 


17.2 SEQUENTIAL IMPORTANCE SAMPLING 

Our interest in this section shifts toward tasks where data are sequentially arriving, and our goal be- 
comes that of sampling from their joint distribution. In other words, we are receiving observations 
x n e Mr of random vectors x„. At some time n, let x \ ■„ = {jc i,..., x n } denote the set of the available 
samples and let the respective joint distribution be denoted as p n (x \- n ). No doubt, this new task is to be 
treated with special care. Not only is the dimensionality of the task (number of random variables, i.e., 
xi „) now time-varying, but also, after some time has elapsed, the dimensionality will be very large and, 
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in general, we expect the corresponding distribution, p n (x\ :n ), to be of a rather complex form. More- 
over, at time instant n, even if we knew how to sample from p n (x i :n ), the required time for sampling 
would be at least of the order of n. Hence, as n increases, even such a case could not be computationally 
feasible for large values of n. Sequential sampling is of particular interest in dynamic systems in the 
context of particle filtering, and we will come to deal with such systems very soon. Our discussion will 
develop around the importance sampling method, which was introduced in Section 14.5. 

17.2.1 IMPORTANCE SAMPLING REVISITED 

Recall from Eq. (14.28) that given (a) a function /(x), (b) a desired distribution p(x) = ^0(x), and 
(c) a proposal distribution q(x), we have 


N 

E [/(x)] : = M/ - J2w(Xi)f(Xi) := A, 

i=l 


where x,- are samples drawn from q(x). Recall, also, that the estimate 

1 N 

i=l 


(17.1) 


(17.2) 


defines an unbiased estimator of the true normalizing constant Z, where w(x, ) are the nonnormalized 
weights, w(xj) = Note that the approximation in Eq. (17.1) equivalently implies the following 
approximation of the desired distribution: 


N 

p(x) — W(Xj)S(x — xi ): discrete random measure approximation. 

i=i 


(17.3) 


In other words, even a continuous PDF is approximated by a set of discrete points and weights assigned 
to them. We say that the distribution is approximated by a discrete random measure defined by the 
particles Xj , i = 1,2,..., N, with respective normalized weights W(xi) W in . The approximating 
random measure is denoted as / = {x/, W■ 

Also, we have already commented in Section 14.5 that a major drawback of importance sampling 
is the large variance of the weights, which becomes more severe in high-dimensional spaces, where 
our interest will be from now on. Let us elaborate on this variance problem a bit more and seek ways 
to bypass/reduce this undesired behavior. 

It can be shown (e.g., [33] and Problem 17.1) that the variance of the corresponding estimator, A. 
in Eq. (17.1) is given by 

varM= K/ / y^ '4 <i7 ' 4) 

Observe that if the numerator / 2 (x) p 2 (x) tends to zero slower than q(x) does, then for fixed N, 
the variance var[|l] —> oo. This demonstrates the significance of selecting q very carefully. It is not 
difficult to see, by minimizing Eq. (17.4), that the optimal choice for q (x), leading to the minimum 
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(zero) variance, is proportional to the product f(x)p(x). We will make use of this resuit later on. Note, 
of course, that the proportionality constant is 1 / p/, which is not known. Thus, this resuit can only be 
considered as a benchmark. 

Concerning the variance issue, let us turn our attention to the unbiased estimator Z of Z in 
Eq. (17.2). It can be shown (Problem 17.2) that 

(17 ' 5) 

By its definition, the variance of Z is directly related to the variance of the weights. It turns out that, in 
practice, the variance in Eq. (17.5) exhibits an exponential dependence on the dimensionality (e.g., [11, 
15] and Problem 17.5). In such cases, the number of samples, N, has to be excessively large in order to 
keep the variance relatively small. One way to partially cope with the variance-related problem is the 
resampling technique. 


17.2.2 RESAMPLING 

Resampling is a very intuitive approach where one attempts a randomized pruning of the available 
samples (particles), drawn from q, by (most likely) discarding those associated with low weights 
and replacing them with samples whose weights have larger values. This is achieved by drawing 
samples from the approximation of pix), denoted as pix), which is based on the discrete random 
measure {Xj, W' ! ' , } ( ^|, in Eq. (17.3). In importance sampling, the involved particles are drawn from 
q(x) and the weights are appropriately computed in order to “match” the desired distribution. Adding 
the extra step of resampling, a new set of unweighted samples is drawn from the discrete approxima¬ 
tion p of p. Using the resampling step, we stili obtain samples that are approximately distributed as /;; 
moreover, particles of low weight have been removed with high probability and thereby, for the next 
time instant, the probability of exploring regions with larger probability masses is increased. There are 
different ways of sampling from a discrete distribution. 

• Multinomial resampling. This method is equivalent to the one presented in Example 14.2. Each par- 
ticle X, is associated with a probability P,- = li' 1 ' 1 . Redrawing N (new) particles will generate from 
each particle, x/, “offsprings” (£^j = N), depending on their respective probability P,. 

Hence, ..., /V 1 ,v 1 will follow a multinomial distribution (Section 2.3), that is, 

v 7 i=t 


In this way, the higher the probability (weight of an originally drawn particle, the higher the 
number of times, N^\ this particle will be redrawn. 

The new discrete estimate of the desired distribution will now be given by 


N 

p(x) = 


1 = 1 


N (0 

— <5(x -x,). 
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CDF 

•u< 2 > 


N 


Xl 


X 2 


x 3 


xn-i aijv 


FIGURE 17.1 

The sample drawn from U ^0, jA determines the first point that defines the set of N equidistant lines, which 
are drawn and cut across the cumulative distribution function (CDF). The respective intersections determine the 
number of times, N (I K the corresponding particle, Xj, will be represented in the set. For the case of the figure, xi 
will appear once, x 2 is missed, and x 3 appears two times. 


From the properties of the multinomial distribution, we have E[N ( 0] = NPj = N W (l> , and hence 
p(x) is an unbiased approximation of p(x). 

• Systematic resampling. Systematic resampling is a variant of the multinomial approach. Recall from 
Example 14.2 that every time a particle is to be (re)drawn, a new sample is generated from the uni- 
form distribution U(0, 1). In contrast, in systematic resampling, the process is not entirely random. 
To generate N particles, we only select randomly one sample, u (l> ~ U((), A). Then define 

m 0') = m ( V + j=2,3,...,N, 

N J 


and set 


i—1 


N u> = card \ Ali j : ^ W {k) < u (]) 


(k) 


k=l 


k=l 


where card{-} denotes the cardinality of the respective set. Fig. 17.1 illustrates the method. The 
resampling algorithm is summarized next. 


Algorithm 17.1 (Resampling). 


• Initialization 

- Input the samples x ; and respective weights W (l 1 , i = 1, 2,..., N. 

- c 0 = 0, A (,) =0, i = 1,2, ...,AT. 

• For i = 1,2, ..., N, Do 

- ci = Cj-i + 1F (,) ; Construet CDF. 

• End For 

• Draw ~U(0, jj) 
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• i — 1 

• For j = 1,2,..., IV, Do 

- „(;) = „(D + tl 

- While l > a 
* i—i +1 

- End While 

- xj = Xi ; Assign sample. 

_ aKO = aA ; ) + l 

• End For 

The output comprises the new samples xj, j = 1,2,..., N, and all the weights are set equal to jj. 
The sample x, will now appear, after resampling, N 1 ' > times. 

The two previously stated resampling methods are not the only possibilities (see, e.g., [12]). 
However, the systematic resampling is the one that is usually adopted due mainly to its easy imple- 
mentation. Systematic resampling was introduced in [25]. 

Resampling schemes resuit in estimates that converge to their true values, as long as the number of 
particles tends to infinity (Problem 17.3). 

17.2.3 SEQUENTIAL SAMPLING 

Let us now apply the experience we have gained in importance sampling to the case of sequentially 
arriving particles. The first examples of such techniques date back to the 1950s (e.g., [19,36]). At 
time n, our interest is to draw samples from the joint distribution 


<j>n (X] :n ) 

Pn(Xi :n )= ---, (17.6) 

based on a proposal distribution q„(X]- n ), where Z„ is the normalizing constant at time n. However, 
we are going to set the same goal that we have adopted for any time-recursive setting throughout 
this book, that is, to keep computational complexity fixed, independent of the time instant, n. Such 
a rationale dictates a time-recursive computation of the involved quantities. To this end, we select a 
proposal distribution of the forni 

C/n(X\ : n') = qn — l(x l :n — \)qn(x t] \x hn — l'). (17.7) 

From Eq. (17.7), it is readily seen that 


n 

q n (xi :n ) = 9 t(xt)] - [^(xife|a:i:t_i). (17.8) 

k=2 

This means that one has only to choose (jc*I ll), k = 2, 3,..., n, together with the inidal (prior) 
q i( jc i). Note that the dimensionality of the involved random vector in qk(-\-), given the past, remains 
fixed for all time instants. Eq. (17.8), viewed from another angle, reveals that in order to draw a single 
(multivariate) sample that spans the time interval up to time n , that is, x\ l> n = [jr^ \ x^\ ..., x^ *}, we 
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build it up recursively\ we first draw ~ q\ (jc) and then draw ~ k = 2, 3,. .., n. 

The corresponding nonnormalized weights are also computed recursively [15]. Indeed, 

, v 0/iC*T:/i) 0 /i—1 l:n— l) 0/iC*T:/i) 

Ww(*l:/i) := - —- 

qn(X\\n) q.n(X\:n) 0/z —1 (*^ l:n— 1) 

_0/i—1 (-*- l:/i— l) 0/zC*'l:/i) 

Qn— 1 C*T:/i—l) 0/i—l C*T:/i—l)*Z/z C*'/! |^l:w—l) 

= W^/i— 1 (■*■ l:/i— l)^/i C*T:/i) (17.9) 

n 

= wn(x 1 )]"~[a J t(xi;*), 

k=2 


where 

:=- (t>k{xi ' k) -, k = 2,3,...,«. (17.10) 

</>*-1 (*1 -k-l)qk(xk\xi:k-l) 

The question that is now raised is how to choose q n (x n \x\ :n -i ), n — 2,3, _A sensible strategy is 

to select it in order to minimize the variance of the weight w n (x\ :n ), given the samples x \ i . It turns 
out that the optimal value, which actually makes the variance zero (Problem 17.4), is given by 


q° pt (x n \x[■„-[) — p n (x n |jc i:„_i): optimal proposal distribution. 


(17.11) 


However, most often in practice, p n (x n \xi:n-i) is not easy to sample and one has to be content with 
adopting some approximation of it. We are now ready to state our first algorithm for sequential impor- 
tance sampling (SIS). 

Algorithm 17.2 (Sequential importance sampling). 


• Select gi(-), q n (-\-), n — 2,3,... 

• Select number of particles, N. 

• For i = 1,2,..., N, Do; Initialize N different realizations/streams. 

- Draw Xj ! * ~ iji(x) 

- Compute the weights wi (x,‘*) = ^ l(x i } 

• End For 

• For i = 1,2,..., N, Do 

- Compute the normalized weights ] . 

• End For 

• For n — 2,3 ,..., Do 

- For i = 1,2,..., N, Do 

* Draw x ( n ] ~ q n (x|X|^_,) 

* Compute the weights 

• = u> B -i(x£j,_i)a„ (*£],); from Eq. (17.9). 
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- End For 

- For i = 1, 2, ..., N, Do 
* Wn' 1 OC W n (x^) n ) 


- End For 
End For 

Once the algorithm has been completed, we can write 


N 


Pn (*l:n) = ^ W^S(x Un ~ X j.„) 


(=1 


However, as we have already said, the variance of the weights has the tendency to increase with n 
(see Problem 17.5). Thus, the resampling version of the sequential importance sampling is usually 
employed. 

Algorithm 17.3 (SIS with resampling). 

• Select <?n(-|-)> n = l,2,... 

• Select number of particles N. 

• For i = 1,2, ..., N, Do 

- Draw * ~ q\_(x) 

- Compute the weights w\ (x < ! > ) = < ^ l( ’ r , 1 ,- ) 1 . 

91 (*i ) 

• End For 

• For i = 1. N, Do 

- Compute the normalized weights W^' 1 

• End For 

• Resample [x^\ to obtain using Algorithm 17.1 


For n = 2,3 ,..., Do 
- For i — 1,2,..., N, Do 


Draw x 


(0 


'qnix\x^ n _ { ) 


* Set *i:n = {*H ) >*i:i_l} 

* Compute w n (x^ n ) = j^cinix^.^); (Eq. (17.9)). 

- End For 

- For i = 1,2,..., N, Do 

* Compute 

- End Do 

- Resample {x^ n , W n (0 }^ =1 to obtain {x^, 

End For 
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Remarks 17.1. 

• Convergence results concerning sequential importance sampling can be found in, for example, 
[5-7]. It turns out that, in practice, the use of resampling leads to substantially smaller variances. 

• From a practical point of view, sequential importance methods with resampling are expected to 
work reasonably well if the desired successive distributions at different time instants do not differ 
much and the choice of q n (x n | jci : „_ i ) is close to the optimal one (see, e.g., [ 15 ]). 


17.3 KALMAN AND PARTICLE FILTERING 

Particle filtering is an instance of the sequential Monte Carlo methods. Particle filtering is a technique 
born in the 1990s and it was first introduced in [18] as an attempt to solve estimation tasks in the 
context of state-space modeling for the more general nonlinear and non-Gaussian scenarios. The term 
“particle filtering” was coined in [3], although the term “particle” had been used in [25]. 

Hidden Markov models (HMMs), which are treated in Section 16.4, and Kalman filters, treated in 
Chapter 4, are special types of state-space (state-observation) modeling. The former address the case 
of discrete state (latent) variables and the latter the continuous case, albeit in the very special case of 
linear and Gaussian scenario. In particle filtering, the interest shifts to models of the following form: 


(17.12) 

(17.13) 


where f n and h„ are nonlinear, in general, (vector) functions, i\ n and v„ are noise sequences, and the 
dimensions of x„ and y„ can be different. The random vector x n is the (latent) state vector and y„ 
corresponds to the observations. There are two inference tasks that are of interest in practice. 

Filtering: Given the set of observations, jq. n , in the time interval [1, n\, compute 


x„ = /„(x„ — 1 ,T)„) 

state equation, 

y n — (X/i • 

observations equation, 


P(Xn\yi.n)- 

Smoothing: Given the set of observations y |. N in a time interval [1, N], compute 

P(x„\y l:N ), 1 <n<N. 

Before we proceed to our main goal, let us review the simpler case, that of Kalman filters, this time 
from a Bayesian viewpoint. 

17.3.1 KALMAN FILTERING: A BAYESIAN POINT OF VIEW 

Kalman filtering was discussed in Section 4.10 in the context of linear estimation methods and the 
mean-square error criterion. In the current section, the Kalman filtering algorithm will be rederived 
following concepts from the theory of graphical models and Bayesian networks, which are treated 
in Chapters 15 and 16. This probabilistic view will then be used for the subsequent nonlinear gener- 
alizations in the framework of particle filtering. For the linear case model, Eqs. (17.12) and (17.13) 
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FIGURE 17.2 

Graphical model corresponding to the state-space modeling for Kalman and particle filters. 


become 


x„ = F n x n -i +T)„, (17.14) 

y n = H n x n +y n , (17.15) 


where F n and H n are matrices of appropriate dimensions. We further assume that the two noise se- 
quences are statistically independent and of a Gaussian nature, that is. 


P0/„)=.A6»?J0, Q„), 
p{v n ) = U{v„ |0, R n ). 


The kick-off point for deriving the associated recursions is the Bayes rule. 


P(x n \y\,n) 


p(y n \ x n,yi:n-l)p(x„\y l:n -i) 


Z„ 

p(y n \Xn)p(x n \y l . n _ l ) 


where 


= J P(y n \Xn)p(Xn\y l:n -l)dXn 

— P(y n \y t:n—t) ’ 


(17.16) 

(17.17) 


(17.18) 


(17.19) 


and we have used the fact that p(y n \x„, = p(y n \x n ), which is a consequence of Eq. (17.15). 

For those who have already read Chapter 15, recall that Kalman filtering is a special case of a Bayesian 
network and corresponds to the graphical model given in Fig. 17.2. Hence, due to the Markov property, 
y„ is independent of the past given the values in x n . Moreover, note that 

P{Xn\y\-.n-l) = J PiXnlXn-uyun-OPiXn-^yiM-^dXn-t 

= J p{x n \x n -l)p{Xn-\\yi :n -l)dx n -\ , (17.20) 

where, once more, the Markov property (i.e., Eq. (17.14)) has been used. 
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Eqs. (17.1 8)—(1 7.20) comprise the set of recursions which lead to the update 

p(x n - l|ji:„_i)-» p(x n \y l:n ), 

starting from the initial (prior) /?(jco l^o) p( x o)- If p( x o) is chosen to be Gaussian, then all the 
involved PDFs tum out to be Gaussian due to Eqs. (17.16) and (17.17) and the linearity of Eqs. (17.14) 
and (17.15); this makes the computation of the integrals a trivial task following the recipe rules in the 
Appendix of Chapter 12. 

Before we proceed further, note that the recursions in Eqs. (17.18) and (17.20) are an instance of 
the sum-product algorithm for graphical models. Indeed, to put our current discussion in this context, 
let us compactly write the previous recursions as 


D(x | V ) P<Jn\*n) 

p( x n-i\yi:n-i)p( x n\ x n-l)d x n-i : filtering. 

P\ x n\y l.n) — z J 

corrector 

predictor 


Note that this is of exactly the same form, within the normalizing factor, as Eq. (16.43) of Chapter 16; 
just replace summation with integration. One can rederive Eq. (17.21) using the sum-product rule, 
following similar steps as for Eq. (16.43). The only difference is that the normalizing constant has to 
be involved in all respective definitions and we replace summations with integrations. Because all the 
involved PDFs are Gaussians, the computation of the involved normalizing constants is trivially done; 
moreover, it suffices to derive recursions only for the respective mean values and covariances. 

In Eq. (17.20), we have 

p(x„lx„-i) =Af(x n \F n x n -i, Q„). 

Let, also, p(x n -i\yi- n _i ) be Gaussian with mean and covariance matrix 

Pn—l |/i — 1 ’ Pn — \\n—\ i 

respectively, where the notation is chosen for the derived recursions to comply with the algorithm given 
in Section 4.10. Then, according to the Appendix of Chapter 12, p(x n is a Gaussian marginal 

PDF with mean and covariance given by (see Eqs. (12.150) and (12.151)) 

Pn\n-\ = FnP-n-Hn-l’ (17.22) 

Pn\n-\ = Qn + F n P„_ i|„_i fj. (17.23) 

Also, in Eq. (17.18) we have 

p(y n \ x n) =N(y n \ H n x n, Rn)- 

From the Appendix of Chapter 12, and taking into account Eqs. (17.22) and (17.23), we find that 
p( x n\y\- n ) i s the posterior (Gaussian) with mean and covariance given by (see Eqs. (12.148) and 
(12.149)) 


P'n\n A«|n—1 “E K n (y n H n fl n |„_i), 


( 17 . 24 ) 
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Pn\n — Pn\n —1 K n H n Pn\n—li (17.25) 

where 

K n = P n]n - l H^S~ 1 , (17.26) 

and 

S n = R n + H n P n \ n -\ Hj. (17.27) 

Note that these are exactly the same recursions that were derived in Section 4.10 for the state estima- 
tion; recall that under the Gaussian assumption, the posterior mean coincides with the least-squares 
estimate. 

Here we have assumed that matrices F n , H n as well as the covariance matrices are known. This is 

most often the case. If not, these can be learned using similar arguments as those used in learning the 

HMM parameters, which are discussed in Section 16.5.2 (see, e.g., [2]). 


17.4 PARTICLE FILTERING 

In Section 4.10, extended Kalman hltering (EKF) was discussed as one possibility to generalize 
Kalman hltering to nonlinear models. Particle hltering, to be discussed next, is a powerful alternative 
technique to EKF. The involved PDFs are approximated by discrete random measures. The underly- 
ing theory is that of sequential importance sampling (SIS); as a matter of fact, particle hltering is an 
instance of SIS. 

Let us now consider the state-space model of the general form in Eqs. (17.12) and (17.13). From 
the specihc form of these equations (and by the Bayesian network nature of such models, for the more 
familiar reader) we can write 


p(x n \xi:n-l, Ji : „—i) = p{x n |*„-l) (17.28) 

and 

P(y n \ x l.n, 3T:„-l) = P(y„\Xn)- (17.29) 

Our starting point is the sequential estimation of p(xi- n \y the estimation of p(x n \y^. n ), which 
comprises our main goal, will be obtained as a by-product. Note that [15] 


P(Xl:n, yi:n) = P( x n, *t:n-l, y n , Jl:n-l) 

= p(x„, y n \xi, n -i, Ji:„_l)p(Xi:„_i, J 1:n _t) 
= P(y„ \X„)p(X n \x„-l)p(XUn-l, 3T:„-l), 


(17.30) 


where Eqs. (17.28) and (17.29) have been employed. 

Our goal is to obtain an approximation, via the generation of particles, of the conditional PDF, 


P( x l:n\yi:n) = 


P(Xl:„,y 1:n ) 

f P{Xl-.n,yi-n)dxl, n 


p(X l:ni y 1 ;/,) 

Z„ 


(17.31) 
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where 


ln '-\ 


Zn •— / p(x \- n , y i :/J ) dx \- n . 


To put the current discussion in the general framework of SIS, compare Eq. (17.31) with Eq. (17.6), 
which leads to the definition 


Then Eq. (17.9) becomes 


where now 


&n (.X 1:«) — 


<Pn(X l:„) := P(Xl:n,yi : „). 

Mn (X ]) = W n — 1 (x \ - n — \)(X n (X 1 \u ), 

P(Xl:n,yi :n ) 


(17.32) 

(17.33) 


P(x l: n — 1, y (X/i |x \ n — [, Jj;,,) 


which from Eq. (17.30) becomes 


, , P(y n \ x n)p(x n |x„_i) 

a n (x i •„ ) =- 

q, 1 (x„|xi:„_i, y l:n ) 


(17.34) 


The final step is to select the proposal distribution. From Section 17.2, recall that the optimal proposal 
distribution is given from Eq. (17.1 1), which for our case takes the form 

qn Pt (x„\Xi :n -l, y l: „) = p(x n \Xi :n -l, y l:n ) 

= p(x n |x„_i,xi;„_ 2 , y n , J i:n _i), 


and exploiting the underlying independencies, as they are imposed by the Bayesian network structure 
of the state-space model, we finally get 


q opt {x n \x\- n ~\, y 1: „) = /?(x„|x„_i, y n ) : optimal proposal distribution. 


(17.35) 


The use of the optimal proposal distribution leads to the following weight update recursion (Prob- 
lem 17.6): 


w«(xi:„) = w„_i(xi:„_i)/?(y„|x„_i): optimal weights. 


(17.36) 


However, as is most often the case in practice, optimality is not always easy to obtain. Note that 
Eq. (17.36) requires the following integration: 


p(y n \Xn-l) = 


J P{y n \ x n)p{x n \x n -l)dx n , 


which may not be tractable. Moreover, even if the integral can be computed, sampling from p(y n |x„_i) 
directly may not be feasible. In any case, even if the optimal proposal distribution cannot be used, we 
can stili select the proposal distribution to be of the form 


q„(x n |xi:„_i, y 1:n ) = q(x n |x„_i, y n ). 


( 17 . 37 ) 








17.4 PARTICLE FILTERING 883 


Note that such a choice is particularly convenient, because sampling at time n only depends on x n -\ 
and y n , and not on the entire history. If, in addition, the goal is to obtain estimates of p(x „| then 
one need not keep in memory all previously generated samples, but only the most recent one, x„. 

We are now ready to write the first particle filtering algorithm. 

Algorithm 17.4 (SIS particle filtering). 

• Select a prior distribution, p, to generate the initial state xq- 

• Select the number of particle streams, N. 

• For i = l, 2,..., IV, Do 

- Draw Xq * ~ p(x); Initialize the N streams of particles. 

- Set Wq > — -k; Set all initial weights equal. 

• End For 

• For n = 1,2,..., Do 

- For i = 1 , 2 ,..., N, Do 

* Draw xl‘ 1 ~ q{x\x^_ v y n ) 

* = w W P( x » > \ x lL)P(yn\x' l , ) ) . formu i as (17.33), (17.34), and (1.7.37). 

q(x),’\* n L v y n ) 

- End For 

- For i = 1 , 2 ,..., N , Do 

* Compute the normalized weights wj , 11 

- End For 

• End For 

Note that the generation of the N streams of particles can take place concurrently, by exploiting 
parallel processing capabilities, if they are available in the processor. 

The particles generated along the i th stream x„\ n — 1,2,..., represent a path/trajectory through 
the state space. Once the particles have been drawn and the normalized weights computed, we obtain 
the estimate 

N 

P(Xl:n\yi :n ) = W^SiXi, n ~ X^J. 

( = 1 

If, as commented earlier, our interest lies in keeping the terminal sample, x„ \ only, then discarding the 
path history, x^} { , we can write 


N 

P{Xn\yv. n ) = Y W n )S( ~Xn ~ X^). 

1 = 1 

Note that as the number of particles, N, tends to infinity, the previous approximations tend to the true 
posterior densities. Fig. 17.3 provides a graphical interpretation of the SIS Algorithm 17.4. 
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w. 





E (*> 

tj n+2' 


w' 


W 


N 

i=l 


FIGURE 17.3 

Three consecutive recursions, for the particle filtering scheme given in Algorithm 17.4, with N = 1 streams of parti- 
cles. The area of the circles corresponds to the size of the normalized weights of the respective particles drawn from 
the proposal distribution. 


Example 17.1. Consider the one-dimensional random walk model written as 

x„ =x„_i + r)„, (17.38) 

y „=x„+v„, (17.39) 

where r\ n ~ Af(rj n \0, o^), v „ ~ A/"(iz„|0, er^), with a£ = 1, a% = 1. Although this is a typical task for 
(linear) Kalman filtering, we will attack it here via the particle filtering rationale in order to demonstrate 
some of the previously reported performance-related issues. The proposal distribution is selected to be 

q(x„\x n -i,y n ) = p(x„\x n -i) =J\f(x n \x n -i,rf). 

1 . Generate T = 1 00 observations, y„ . n = 1,2, ..., T, to be usedby Algorithm 17.4. To this end, start 
with an arbitrary state value, for example, a'q = 0, and generate a realization of the random walk, 
drawing samples from the Gaussians (we know how to generate Gaussian samples) AT-IO, <j~ ) 
and A/"(-10, er“), according to Eqs. (17.38) and (17.39). Fig. 17.4 shows a realization for the output 
variable. Our goal is to use the sequence of the resulting observations to generate particles and 
demonstrate the increase of the variance of the associated weights as time goes by. 
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FIGURE 17.4 

The observation sequence forExample 17.1. 


2 . Use A/"(- 1 0, 1) to initialize N — 200 particle streams, x u \ i — 1,2. N, and initialize the nor- 

malized weights to equal values, Wq 1) = jj, i = 1,2,..., N. Fig. 17.5 provides the corresponding 
plot. 

3. Use Algorithm 17.4 and plot the resulting particles together with the respective weights at time 
instants n = 0, n = 1, n — 3, and n — 30. Observe how the variance of the weights increases with 
time. At time n = 30 only a few particles have nonzero weights. 

4. Repeat the experiment with N = 1000. Fig. 17.6 is the counterpart of Fig. 17.5 for the snapshots of 
n = 3 and n — 30. Observe that increasing the number of particles improves the performance with 
respect to the variance of weights. This is one path to obtain more particles with significant weight 
values. The other path is via resampling techniques. 

17.4.1 DEGENERACY 

Particle filtering is a special case of sequential importance sampling; hence, everything that has been 
said in Section 17.2 concerning the respective performance is also applied here. 

A major problem is the degeneracy phenomenon. The variance of the importance weights increases 
in time, and after a few iterations only very few (or even only one) of the particles are assigned non- 
negligible weights, and the discrete random measure degenerates quickly. There are two methods for 
reducing degeneracy: one is selecting a good proposal distribution and the other is resampling. 

We know the optimal choice for the proposal distribution is 

<?(' 1*1-1 , y „) = p(-I-* 4 _i , y n )- 

There are cases where this is available in analytic form. For example, this happens if the noise sources 
are Gaussian and the observation equation is linear (e.g., [11]). If analytic forms are not available and 
direct sampling is not possible, approximations of p(-|jc^ ( , y„) are mobilized. Our familiar (from 
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Particles 



-3 -2 -1 0 1 2 3 4 5 

Particles 

(C) 


xlO -3 



- 3 - 2-10 1 2 3 4 

Particles 

(D) 


FIGURE 17.5 

Plot of N = 200 generated particles with the corresponding (normalized) weights, for Example 17.1, at time instants 
(A) n = 0, (B) n = 1, (C) n= 3, and (D) n = 30. Observe that as time goes by, the variance of the weights increases. 
At time n = 30, only very few particles have a nonzero weight value. 


Chapter 1 2) Gaussian approximation via local linearization of In p(■ x ,y n ) is a possibility [11]. 

The use of suboptimal filtering techniques such as the extended/unscented Kalman filter have also 
been advocated [37]. In general, it must be kept in mind that the choice of the proposal distribution 
plays a crucial role in the performance of particle filtering. Resampling is the other path that has been 
discussed in Section 17.2.2. The counterpart of Algorithm 17.1 can also be adopted for the case of 
particle filtering. However, we are going to give a slightly modified version of it. 

17.4.2 GENERIC PARTICLE FILTERING 

Resampling has a number of advantages. It discards, with high probability, particles of low weights; 
that is, only particles corresponding to regions of high-probability mass are propagated. Of course, 
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(A) (B) 


FIGURE 17.6 

Plot of N = 1000 generated particles with the corresponding (normalized) weights, for Example 17.1, at time 
instants (A) n = 3 and (B) n = 30. As expected, compared to Fig. 17.5, more particles with significant weights 
survive. 


resampling has its own limitations. For example, a particle of low weight at time n will not necessar- 
ily have a low weight at later time instants. In such a case, resampling is rather wasteful. Moreover, 
resampling limits the potential of parallelizing the computational process, because particles along the 
different streams have to be “combined” at each time instant. However, some efforts for enhancing 
parallelism have been reported (see, e.g., [21]). Also, particles corresponding to high values of weights 
are drawn many times and lead to a set of samples of low diversity; this phenomenon is also known as 
sample impoverishment. The effects of this phenomenon become more severe in cases of low state/pro- 
cess noise, Y| n , in Eq. (17.12), where the set of the sampling points may end up comprising a single 
point (e.g., [1]). 

Hence, avoiding resampling can be beneficial. In practice, resampling is performed only if a related 
metric of the variance of the weights is below a threshold. In [28,29], the effective number of samples 
is approximated by 


N eff ^ 



( 17 . 40 ) 


The value of this index ranges from 1 to N . Resampling is performed if N e ff < Nt, typically with 



Algorithm 17.5 (Generic particle filtering). 


Select a prior distribution, p, to generate particles for the initial state xo. 
Select the number of particle streams, N. 
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For i — 1,2,..., N, Do 


- Draw x 

/(0 


d) 

0 ' 


■ p(x); Initialize N streams. 


set Wq’ = i; All initial normalized weights are equal. 

End For 

For n — 1,2, 3,..., Do 
- For i — 1,2,..., N, Do 


' q(x\x^_ v y n ) 


* Draw x,*/ * 

* W tl =<_!--JA)--- 

q(*n l*„-i -y n ) 

End For 

For i = 1,2,..., N, Do 

* Compute the normalized W„\ 

End For 

Compute N e ff\ Eq. (17.40). 

If N e ff < Nt ; preselected value Nj. 

* Resample {x^\ Wn^}^ =l to obtain 


-(i) 

■x„ , 


Wn ] = T7 


* : 

- End If 

• End For 

Fig. 17.7 presents a graphical illustration of the time evolution of the algorithm. 

Remarks 17.2. 

• A popular choice for the proposal distribution is the prior 

®(*l X n-l> yn) = 

which yields the following weights’ update recursion, 

Wn 0 = Wn-lP^nl^n)- 

The resulting algorithm is known as sampling-importance-resampling (SIR). The great advantage 
of such a choice is its simplicity. However, the generation mechanism of particles ignores important 
information that resides in the observation sequence; the proposal distribution is independent of the 
observations. This may lead to poor results. A remedy can be offered by the use of auxiliary particle 
filtering, to be reviewed next. Another possibility is discussed in [22], via a combination of the prior 
and the optimal proposal distributions. 

Example 17.2. Repeat Example 17.1, using N = 200 particles, for Algorithm 17.5. Use the threshold 
value Nj — 100. Observe in Fig. 17.8 that, for the corresponding time instants, more particles with 
significant weights are generated compared to Fig. 17.5. 
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FIGURE 17.7 

Three successive time iterations for N = 7 streams of particles corresponding to Algorithm 17.5. At steps n and 
n + 2 resampling is performed. At step n + 1 no resampling is needed. 


17.4.3 AUXILIARY PARTICLE FILTERING 

Auxiliary pcirticle filters were introduced in [34] in order to improve performance when dealing with 
heavy-tailed distributions. The method introduces an auxiliary variable; this is the index of a particle at 
the previous time instant. We allow for a particle in the i th stream at time n to be drawn using a particle 
from a different stream at time n — 1. Let the /th particle at time n be x„ 1 and let the index of its 
“parent” particle at time n — 1 be i n -\. The idea is to sample for the pair {x\[\ /'„_]), i — 1.2,..., N. 
Employing the Bayes rule, we obtain 

P(x n ,i\yi:„) oc p(y n \x n )p(x n ,i\y 1:n _ 1 ) 

= P(y n \ x n)p(x„\i,y 1:n _ l )P(i\y l . n _ 1 ), 


(17.41) 

















































890 CHAPTER 17 PARTICLE FILTERING 



-3 -2 -1 0 1 2 3 4 5 6 


Particles 

(A) 



-3 -2 -1 0 1 2 3 4 5 6 7 

Particles 

(B) 


FIGURE 17.8 

Plot of N = 200 generated particles with the corresponding (normalized) weights, for Example 17.2 using resam- 
pling, at time instants (A) n = 3 and (B) n = 30. Compared to Figs. 17. 5C and D, more particles with significant 
weights survive. 


where the conditional independencies underlying the state-space model have been used. To unclutter 
notation, we have used x n in place of xj'\ and the subscript n — 1 has been omitted from i n -\ and we 
use i instead. Note that by the definition of the index i„-i, we have 

p(x„\i, = P(x„\x^!_ v y 1:n _i) = piXn^lj), (17.42) 

and also 

P (i\yi:n-0 — W n\- 

Thus, we can write 

p(x n , i I y l :n ) OC p (y n \ X n )p (x„ 

The proposal distribution is chosen as 

q(x n ,i\y x ,n)<xp(y n \ } )p (x n ) W^ l \. (17.45) 

Note that we have used fi„ ^ in place of x„ in p(y„ \x n ), because x n is stili to be drawn. The estimate 
is chosen in order to be easily computed and at the same time to be a good representative of 
x n . Typically, can be the mean, the mode, a draw, or another value associated with the distribu¬ 
tion p(x n ); for example, ~ p(x n Also, if the state equation is x n — f(x n - 1 ) + j/„, a 
good choice would be p!,l* = / 


( 17 . 43 ) 

( 17 . 44 ) 
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Applying the Bayes rule in Eq. (17.45) and adopting 

q(x n \i, yi.n) = p(x„ l*£Ai), 


we obtain 

q(i I yv.„) «e p(y„ I nh’) wpP ,. (17.46) 

Hence, we draw the value of the index from a multinomial distribution, i.e., 

«■n- 1 ~?(i|^ 1:B )ocp(j„l^ , ' ) )W’ JI ( ^ 1I i = 1,2. N. (17.47) 

The index z„_i identifies the distribution from which xjp will be drawn, i.e., 

xip ~ p(JC„|JC, ( ;r i l) ), i = 1,2,..., IV. (17.48) 

Note that Eq. (17.47) actually performs a resampling. However, now, the resampling at time n — 1 
takes into consideration information that becomes available at time n, via the observation y n . This 
information is exploited in order to determine which particles are to survive, after resampling at a 
given time instant, so that their “offsprings” are likely to land in regions of high-probability mass. 
Once sample xjp has been drawn, the index z„_j is discarded, which is equivalent to marginalizing 
p(x„, i\y x - n ) to obtain p(x n \y l . n ). Each sample xj, l> is finally assigned a weight according to 

(i) P(Xn\in-l\yi:n) p(y n \x ( ,P) 

w n OC - 7j] - = -pr-, 

q (x/i ,i n —i \yi-. n ) p(y„\Pn ) 

which results by dividing the right-hand sides of Eqs. (17.44) and (17.45). Note that the weight accounts 
for the mismatch between the likelihood p(y n |-) at the actual sample and at the predicted point, 

The resulting algorithm is summarized next. 

Algorithm 17.6 (Auxiliary particle flltering). 

• Initialization: Select a prior distribution, p, to generate the initial state xq. 

• Select N. 

• For i = 1,2,..., N, Do 

- Draw x{p ~ p(x): Initialize N streams of particles. 

- Set Wp ) = A; Set ali normalized weights to equal values. 

• End For 

• For n = 1,2,..., Do 

- For i = 1,2,..., N, Do 

* Draw/compute pP 1 

* Q, = piyJplP) W ( n P j; This corresponds to q{i\y\- n ) in Eq. (17.46). 

- End For 
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- For i = 1,2,..., N, Do 

* Compute normalized Qj 

- End For 

- For i — 1,2,..., N, Do 

* i'n-1 ~ Qi\ Eq. (17.47). 

* Draw xh } ~ p(x ’) 

* Compute — p(y ^ x ” ^ 

p(y„ 1 / 4 . ) 

- End For 

- For i — 1,2,..., N, Do 

* Compute normalized 

- End For 
• End For 

Fig. 17.9 shows N = 200 particles and their respective normalized weights, generated by Algo- 
rithm 17.6 for the observation sequence of Example 17.1 and using the same proposal distribution. 
Observe that compared to the corresponding Figs. 17.5 and 17.8, a substantially larger number of par¬ 
ticles with significant weights survive. 

The previous algorithm is sometimes called the single-stage auxiliary particle, filter as opposed to 
the two-stage one, which was originally proposed in [34]. The latter involved an extra resampling step 
to obtain samples with equal weights. It has been experimentally verified that the single-stage version 
leads to enhanced performance, and it is the one that is more widely used. It has been reported that 
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FIGURE 17.9 

Plot of N = 200 generated particles with the corresponding (normalized) weights, for the same observation se¬ 
quence as that in Example 17.1, using the auxiliary particle filtering algorithm, at time instants (A) n = 3 and 
(B) n = 30. Compared to Figs. 17.5C and D and Fig. 17.8, more particles with significant weights survive. 
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the auxiliary particle filter may lead to enhanced performance compared to Algorithm 17.5, for high 
signal-to-noise ratios. However, for high-noise terrains its performance degrades (see, e.g., [1]). More 
results concerning the performance and analysis of the auxiliary filter can be found in, for example, 

[13,23,35]. 

Remarks 17 . 3 . 

• Besides the algorithms presented earlier, a number of variants have been proposed over the years 
in order to overcome the main limitations of particle filters, associated with the increasing variance 
and the sample impoverishment problem. In resample-move [17] and block sampling [14], instead 
of just sampling for at time instant n, one also tries to modify past values, over a window 
[n — 1, n — L + 1] of fixed size L, in light of the newly arrived observation y n . In the regularized 
particle filter [32], in the resampling stage of Algorithm 17.5, instead of sampling from a discrete 
distribution, samples are drawn from a smooth approximation, 

N 

P(Xn\yi: n )-J2 W " >K (x n 

/=1 

where K is a smooth kernel density function. In [26,27], the posteriors are approximated by 
Gaussians; as opposed to the more classical extended Kalman filters, the updating and filtering 
is accomplished via the propagation of particles. 

The interested reader may find more information concerning particle filtering in the tutorial papers 

[1,9,15]. 

• Rao-Blackwellization is a technique used to reduce the variance of estimates that are obtained via 
Monte Carlo sampling methods (e.g., [4]). To this end, this technique has also been employed 
in particle filtering of dynamic systems. It turns out that often in practice, some of the States are 
conditionally linear given the nonlinear ones. The main idea consists of treating the linear States 
differently by viewing them as nuisance parameters and marginalizing them out of the estimation 
process. The particles of the nonlinear States are propagated randomly, and then the task is treated 
linearly via the use of a Kalman filter (see, e.g., [10,12,24]). 

• Smoothing is closely related to filtering processing. In filtering, the goal lies in obtaining estimates 
of jc i (x„) based on observations taken in the interval [I. n ], that is, on jq.,,. In smoothing, one 
obtains estimates of x„ based on an observation set y\ n+ k, k > 0 . There are two paths to smoothing. 
One is known as fixed lag smoothing, where k is a fixed lag. The other is known as fixed interval, 
where one is interested in obtaining estimates based on observations taken over an interval [1, T\, 
that is, based on a fixed set of measurements y |. 7 -. 

There are different algorithmic approaches to smoothing. The naive one is to run the particle filtering 
up to time k or T and use the obtained weights for weighting the particles at time n, in order to form 
the random measure, i.e.. 


N 

P(Xn\yi:n+k) - W n+k S(X >‘ ~ X n^' 
i =1 

This can be a reasonable approximation for small values of k (or T — n). Other, more refined, 
techniques adopt a two-pass rationale. First, a particle filtering is run, and then a backward set of 
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recursions is used to modify the weights (see, e.g., [10]). 

• A summary concerning convergence results related to particle filtering can be found in, for example, 

[ 6 ]. 

• A survey on applications of particle filtering in signal processing-related tasks is given in [8,9]. 

• Following the general trend for developing algorithms for distributed learning, a major research 
effort has been dedicated in this direction in the context of particle filtering. For a review on such 
schemes, see, for example, [20]. 

• One of the main difficulties of the particle filtering methods is that the number of particles required 
to approximate the underlying distributions increases exponentially with the state dimension. To 
overcome this problem, several methods have been proposed. In [30], the authors propose to par- 
tition the state and estimate each partition independently. In [16], the annealed particle filter is 
proposed, which implements a coarse-to-fine strategy by using a series of smoothed weighting func- 
tions. The unscented particle filter [37] proposes to use the unscented transform for each particle to 
avoid wasting resources in low likelihood regions. In [31], a hierarchical search strategy is proposed 
that uses auxiliary lower dimension models to guide the search in the higher-dimensional one. 

Example 17.3. Stochastic volatility model. Consider the following state-space model for generating 

the observations: 


X/z — GiX n — 1 T" Tpz, 
yH = Av„ exp(y) . 

This model belongs to a more general class known as stochastic volatility’ models, where the variance 
of a process is itself randomly distributed. Such models are used in hnancial mathematics to model 
derivative securities, such as options. The state variable is known as the log-volatility. We assume the 
two noise sequences to be i.i.d. and mutually independent Gaussians with zero mean and variances er^ 
and er~, respectively. The model parameters a and /1 are known as the persistence in volatility shocks 
and modal volatility, respectively. The adopted values for the parameters are er" = 0.178, er^ = 1, a = 
0.97, and A = 0.69. 

The goal of the example is to generate a sequence of observations and then, based on these mea- 
surements, to predict the state, which is assumed to be unknown. To this end, we generate a sequence 
of N = 2000 particles, and the state variable at each time instant is estimated as the weighted average 
of the generated particles, that is, 

z'=l 

Both the SIR Algorithm 17.5 and the auxiliary filter method of Algorithm 17.6 were used. The proposal 
distribution was 

q(x n \x n -l) = N{x n \ax n -i, o 1 -). 

Fig. 17.10 shows the observation sequence together with the obtained estimate. For comparison rea- 
sons, the corresponding true state value is also shown. Both methods for generating particles gave 
almost identical results, and we only show one of them. Observe how closely the estimates follow the 
true values. 
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FIGURE 17.10 


The observation sequence generated by the volatility model together with the true and estimated values of the state 
variable. 


Example 17.4. Visual tracking. Consider the problem of visual tracking of a circle, which has a con¬ 
stant and known radius. We seek to track its position, that is, the coordinates of its center, x = [xi, X 2 ] r . 
This vector will comprise the state variable. The model for generating the observations is given by 


x„ =x„_! + r)„, 
y„ = x« + v„, 


(17.49) 


where x\ n is a uniform noise in the interval [—10, 10] pixels, for each dimension. Note that due to the 
uniform nature of the noise, Kalman filtering, in its Standard formulation, is no longer the optimal 
choice, in spite of the linearity of the model. The noise v„ follows a Gaussian PDF A'7(), XV), where 


Initially, the target circle is located in the image center. The particle filter employs N =50 particles 
and the SIS sampling method was used (see, also, MATLAB® Exercise 17.12). 

Fig. 17.11 shows the circle and the generated particles, which attempt to track the center of the 
circle from the noisy observations, for different time instants. Observe how closely the particles track 
the center of the circle as it moves around. A related video is available from the companion site of this 
book. 


PROBLEMS 


17.1 Let 
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FIGURE 17.11 

The circle (in gray) and the generated particles (in red) for time instants n = 1, n = 30, n = 60, and n = 120. 


and let q(x) be the proposal distribution. Show that if 


w(x) := 


p(x) 

q(x) 


and 



N 

J2^Xi)f(Xi), 
i =1 


then the variance 




Observe that if f 2 (x)p 2 (x) goes to zero slower than q(x ), then for fixed N , a 2 


oo. 
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17.2 In importance sampling, with weights defined as 



where 


1 


p(x) = -</>(*), 


we know from Problem 14.6 that the estimate 


±=] N ii w(x ^ 

i=i 


defines an unbiased estimator of the normalizing constant, Z. Show that the respective variance 
is given by 



17.3 Show that using resampling in importance sampling, as the number of particles tends to infinity, 
the approximating, by the respective discrete random measure, distribution, p, tends to the true 
(desired) one, p. 

Hint: Consider the one-dimensional case. 

17.4 Show that in sequential importance sampling, the proposal distribution that minimizes the vari¬ 
ance of the weight at time n, conditioned on x | „_ i, is given by 


Qn (Xn l) — Pn (.Xn \Xl:n— l)* 


17.5 In a sequential importance sampling task, let 


n 


Pn(X 1:„) = ^ATfelO, 1), 


k=l 



and let the proposal distribution be 


n 


q n {x\-.n) = ]”[ jV(jc*| 0, a 2 ). 


k= 1 


n 

Let the estimate of Z n = (2jt) 2 be 
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Show that the variance of the estimator is given by 




n 


2er 2 — 1 


Observe that for a 2 > 1/2, which is the range of values for the above formula to make sense 


and guarantees a finite value for the variance, the variance exhibits an exponential increase with 
respect to n. To keep the variance small, one has to make N very large, that is, generate a very 
large number of particles [15]. 

17.6 Prove that the use of the optimal proposal distribution in particle filtering leads to 


M'n (x l:n) — — J (.£ \:n — 1 )p(j7f \Xn — i)- 


MATLAB® EXERCISES 


17.7 For the state-space model of Example 17.1, implement the generic particle filtering algorithm 
for different numbers of particle streams N and different thresholds of effective particle sizes 


N eff- 


Hint : Start by selecting a distribution (the normal should be a good start) and initialize. Then 
update the particles in each step according to the algorithm. Finally, check whether N e ff is 
lower than the threshold, and if it is, continue with the resampling process. 

17.8 For the same example as before, implement the SIS particle filtering algorithm and plot the 
resulting particles together with the normalized weights for various time instances n. Observe 
the degeneracy phenomenon of the weights as time evolves. 

17.9 For Example 17.1, implement the SIR particle filtering algorithm for different numbers of par¬ 
ticle streams N and for various time instances n. Use Nj — N/2. Compare the performance of 
SIR and SIS algorithms. 

17.10 Repeat the previous exercise, implement the auxiliary particle filtering (APF) algorithm, and 
compare the particle-weight histogram with the ones obtained from SIS and SIR algorithms. 
Observe that the number of particles with significant weights that survive is substantially larger. 

17.11 Reproduce Fig. 17.10 for the stochastic volatility model of Example 17.3 and observe how the 
estimated sequence x n follows the true sequence x n based on the observations y n . 

17.12 Develop the MATLAB® code to reproduce the visual tracking of the circle of Example 17.4. 
Because, at each time instant, we are only interested in x n and not in the whole sequence, 
modify the SIS sampling in Algorithm 17.4 to care for this case. 

Specifically, given Eqs. (17.28) and (17.29), in order to estimate x„ instead of xi „, 
Eq. (17.31) is simplified to 



p(y„\xn)p(x„\yi:n-i) 


( 17 . 50 ) 



where 



p(x n \x n -\)p{x n -{\y { , n _ l ) dx n -[. 


( 17 . 51 ) 





REFERENCES 899 


The samples are now weighted as 


(0 P(Xn ] \yi: n ) 

qix^lyv.n) 


(17.52) 


and a popular selection for the proposal distribution is 


q(x n \y X:n ) = p{x n \y x , n _x). 


(17.53) 


Substituting Eqs. (17.53) and (17.50) into Eq. (17.52), we get the following rule for the weights: 



(17.54) 
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18.1 INTRODUCTION 

Neural networks have a long history that goes back to the first attempts to understand how the human 
(and more generally, the mammal) brain works and how what we call intelligence is formed. 

From a physiological point of view, one can trace the beginning of the field back to the work of 
Santiago Ramon y Cajal [187], who discovered that the basic building element of the brain is the 
neuron. The brain comprises approximately 60 to 100 billions neurons; that is, a number of the same 
order as the number of stars in our galaxy! Each neuron is connected with other neurons via elementary 
structural and functional units/links, known as synapses. It is estimated that there are 50 to 100 trillions 
of synapses. These links mediate information between connected neurons. The most common type of 
synapses are the Chemical ones, which convert electric pulses, produced by a neuron, to a Chemical 
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signal and then back to an electrical one. Via these links, each neuron is connected to other neurons 
and this happens in a hierarchically structured way, in a layer-wise fashion. 

Santiago Ramon y Cajal (1852-1934) was a Spanish pathologist, histologist, neuroscientist, and 
Nobel laureate. His many pioneering investigations of the microscopic structure of the brain have 
established him as the father of modern neuroscience. 

A milestone from the learning theory’s point of view occurred in 1943, when Warren McCulloch 
and Walter Pitts [158] developed a computational model for the basic neuron. Moreover, they provided 
results that tie neurophysiology with mathematical logic. They showed that given a sufficient number 
of neurons and adjusting appropriately the synaptic links, each one associated with a weight, one can 
compute, in principle, any computable function. As a matter of fact, it is generally accepted that this is 
the paper that gave birth to the fields of neural networks and artificial intelligence. 

Warren McCulloch (1898-1969) was an American psychiatrist and neuroanatomist who spent many 
years studying the representation of an event in the neural system. Walter Pitts (1923-1969) was an 
American logician who worked in the held of cognitive psychology. He was a mathematical prodigy 
and he taught himself logic and mathematics. At the age of 12, he read Principia Mathematica by 
Alfred North Whitehead and Bertrand Russell and he wrote a letter to Russell commenting on certain 
parts of the book. He worked with a number of great mathematicians and logicians, including Wiener, 
Householder, and Carnap. When he met McCulloch at the University of Chicago, he was familiar with 
the work of Leibnitz on computing, which inspired them to study whether the nervous system could be 
considered to be a type of universal computing device. This gave birth to their 1943 paper, mentioned 
in the reference before. 

Frank Rosenblatt [195,196] borrowed the idea of a neuron model, as suggested by McCulloch and 
Pitts, to build a true learning machine which learns from a set of training data. In the most basic version 
of operation, he used a single neuron and adopted a rule that can learn to separate data, which belong to 
two linearly separable classes. That is, he built a pattern recognition system. He called the basic neuron 
a perceptron and developed a rule/algorithm, th e perceptron algorithm, for the respective training. The 
perceptron will be the kick-off point for our tour in this chapter. 

Frank Rosenblatt (1928-1971) was educated at Corneli, where he obtained his PhD in 1956. In 
1959, he took over as director of CornelTs Cognitive Systems Research Program and also as a lecturer 
in the psychology department. He used an IBM 704 computer to simulate his perceptron and later built 
a special purpose hardware which implemented the perceptron learning rule. 

Neural networks are learning machines, comprising a large number of neurons, which are con¬ 
nected in a layered fashion. Learning is achieved by adjusting the synaptic weights to minimize a 
preselected cost function. It took almost 25 years, after the pioneering work of Rosenblatt, for neural 
networks to hnd their widespread use in machine learning. This is the time period needed for the ba¬ 
sic McCulloch-Pitts model of a neuron to be generalized and lead to an algorithm for training such 
networks. A breakthrough came under the name backpropagation algorithm , which was developed 
for training neural networks based on a set of input-output training samples. Backpropagation is also 
treated in detail in this chapter. 

It is interesting to note that neural networks dominated the held of machine learning for almost a 
decade, from 1986 until the middle of the 1990s. Then, they were superseded, to a large extent, by 
the support vector machines, which established their reign until 2010 or so. After that, neural networks 
with many layers, known as deep networks, have taken over the “kingdom” of machine learning. Earlier 
works, associated with what is known as convolutional networks [129] and recurrent neural networks 
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[94] have inspired the field that is now flourishing. This comeback, however, would not have been 
possible without the availability of computational power, due to advances in computer architectures, as 
well as the buildup of big data sets that are needed for training such networks. 

Interestingly enough, there is one name that is associated with the revival of interest on neural 
networks, both in the mid-1980s and in the first decade of the 21st century; this is the name of Geoffrey 
Hinton [86,199]. Geoffrey Hinton, together with Yoshua Bengio and Yann Lecun, received the 2019 
Turing award for their contributions to the field of neural networks. 


18.2 THE PERCEPTRON 

Our starting point is the simple problem of a linearly separable two-class (o> \, a> 2 ) classification task. 
In other words, we are given a set of training samples, (y„, x n ), n — 1,2 ,,N, with y n e { — 1, +1}, 
x n e R / and it is assumed that there is a hyperplane, 

8lx = 0 , 


such that 


0\ x > 0, if x e (D\, 

0l x < 0, if x e (i> 2 - 

In other words, such a hyperplane classifies correctly all the points in the training set. For notational 
simplification, the bias terni of the hyperplane has been absorbed in 0* after extending the dimension- 
ality of the input space by one, as has been explained in Chapter 3 and used in various parts of this 
book. 

The goal now becomes that of developing an algorithm that iteratively computes a hyperplane that 
classifies correctly ali the patterns from both classes. To this end, a cost function is adopted. 


The perceptron cost: Let the available vector estimate at the current iteration step of the unknown 
parameters be 0. Then there are two possibilities. The first one is that all points are classified correctly; 
this means that a solution has been obtained. The other alternative is that 0 classifies correctly some of 
the points and the rest are misclassified. Let y be the set of all misclassified samples. The perceptron 
cost is defined as 


J(0) = — ^ y„0 T x n : perceptron cost, 
n:x n ey 


(18.1) 


where 


+ 1, if:ce&>i, 

— 1, ifxea>2- 


(18.2) 


Observe that the cost function is nonnegative. Indeed, because the sum is over the misclassified points, 
if x n e eo i ( 022 ), then 0 1 x n < (>) 0, rendering the product —y n 0 T x n > 0. A solution is achieved if 
there are no misclassified points, that is, y — 0. By convention, we can say that in this case J{0) = 0. 
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The perceptron cost function is not differentiable at all points. It is a continuous piece-wise linear 
function. Indeed, let us write it in a slightly different way. 





J2 y» x n 

n.x „sy 


e. 


This is a linear function with respect to fi, as long as the number of misclassified points remains the 
same. However, as one slowly changes the value of fi , which corresponds to a change of the position 
of the respective hyperplane, there will be a point where the number of misclassified samples in y 
suddenly changes; this is the time where a sample in the training set changes its relative position with 
respect to the (moving) hyperplane and as a consequence the set y is modified. After this change, J (fi ) 
will correspond to a new linear function. 

The perceptron algorithm: It can be shown (for example, [167,196]) that, starting from an arbitrary 
point, 0 <o \ the following iterative update. 


$(0 _ 0 (' 1 > -j- y n x n : the perceptron rule, 

n:x„e y 


(18.3) 


converges after & finite number of steps. The parameter sequence /i, is judicially chosen to guarantee 
convergence. Note that this is the same algorithm as the one derived in Section 8.10.2, for minimizing 
the hinge loss function via the notion of subgradient. 

Besides the previous scheme, another version of the algorithm considers one sample per iteration 
in a cyclic fashion, until the algorithm converges. Let us denote by y (i ), x o), (i) e {1,2,, N}, the 
training pair that is presented in the algorithm at the /th iteration step. Then the update iteration 
becomes 


fi+ iiiya)X(i), if X(i) is misclassified by fi {1 
otherwise. 


(18.4) 


In other words, starting from an initial estimate, e.g., randomly initializing fi" >] with some small values, 
we test each one of the samples, x „, n = 1,2,..., N. Every time a sample is misclassified, action is 
taken for a correction. Otherwise no action is required. Once all samples have been considered, we say 
that one epoch has been completed. If no convergence has been attained, all samples are reconsidered 
in a second epoch, and so on. This algorithmic version is known as a pattem-by-pattem scheme. 
Sometimes it is also referred to as online algorithm. However, note that the term “online” has been 
used in previous chapters in a different context, when data were received in a streaming/sequential 
fashion and their number was unbounded. In contrast, in the current context, the total number of data 


1 The Symbol (i) has been adopted to denote the index of the samples, instead of i, because we do not know which point will 
be presented to the algorithm at the i th iteration. Recall that each training point is considered many times, until convergence is 
achieved. 
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FIGURE 18.1 

Pattern x is misclassified by the red line. The action of the perceptron rule is to turn the hyperplane toward the 
point x, in an attempt to include it in the correct side of the new hyperplane and classify it correctly. The new 
hyperplane is defined by 6 {l) and it is shown by the black line. 


samples is fixed and the algorithm considers them in a cyclic fashion , epoch after epoch. To avoid 
confusion the former pattern-by-pattern term will be adopted. 

After a successive finite number of epochs, the algorithm is guaranteed to converge. Note that 
for convergence, the sequence /x, must be appropriately chosen. This is pretty familiar to us by now. 
However, for the case of the perceptron algorithm, convergence is stili guaranteed even if /x, is a 
positive constant, m = fi > 0, usually taken to be equal to one (Problem 18.1). 

The formulation in (18.4) brings the perceptron algorithm under the umbrella of the so-called 
reward-punishment philosophy of learning. If the current estimate succeeds in predicting the class 
of the respective pattern, no action is taken (reward). Otherwise, the algorithm is punished to perform 
an update. 

Fig. 18.1 provides a geometric interpretation of the perceptron rule. Assume that sample x is mis¬ 
classified by the hyperplane, 0 *•' ~ 1 ’. As we know from geometry, 0 “ 11 corresponds to a vector that is 
perpendicular to the hyperplane which is defined by this vector (see also Fig. 11.15 in Section 1 1.10.1). 
Because x lies in the (—) side of the hyperplane and it is misclassified, it belongs to class a> Hence, 
assuming /x, = 1, the applied correction by the algorithm is 

0 (i) =0 (i ~ l) +x, 

and its effect is to turn the hyperplane to the direction toward x to place it in the (+) side of the new 
hyperplane, which is defined by the updated estimate 0 {l \ 

The perceptron algorithm in its pattem-by-pattern mode of operation is summarized in Algo¬ 
rithm 18.1. 

Algorithm 18.1 (The pattern-by-pattern perceptron algorithm). 


Initialization 
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- Initialize 0®; usually, randomly, to some small values. 

- Select /i; usually it is set equal to one. 

- i = 1. 

• Repeat; Each iteration corresponds to one epoch. 

- counter = 0; Counts the number of updates per epoch. 

- For n = 1, 2,..., N, Do; For each epoch, ali samples are presented once. 

* If (y„Jtj0 (! -1) < 0) Then 

0(0 _ 0(i-l) _|_ iiy nXn 

i = i -{-1 

counter=counter +1 

- End For 

• Until counter=0 

Once the perceptron algorithm has run and converged, we have the weights, 0,-, i = 1,2,of 
the synapses of the associated neuron/perceptron as well as the bias term 0$. These can now be used 
to classify unknown patterns. Fig. 18.2A shows the corresponding architecture of the basic neuron ele- 
ment. The features jc/ , i = 1,2,... , /, are applied to the input nodes. In turn, each feature is multiplied 
by the respective synapse (weight), and then the bias term is added on their linear combination. The 
outcome of this operation then goes through a nonlinear function, /, known as the cictivation function. 
Depending on the form of the nonlinearity, different types of neurons occur. In the more classical one, 
known as the McCulloch-Pitts neuron, the activation function is the Heaviside one, that is, 


f(z) = 


1, if z > 0, 
0, if z < 0. 


(18.5) 


Usually, the summation operation and the nonlinearity are merged to form a node in the respective 
graph and the architecture in Fig. 18.2B occurs. In the sequel, both terms, neuron and node, will be 
used interchangeably, the latter indicating a neuron within a larger network. 

Thus, the basic model of an (artificial) neuron comprises the concatenation of (a) a linear combiner, 
(b) a threshold value (bias), and (c) a nonlinearity. Note that because of the existence of the nonlinearity. 




FIGURE 18.2 

(A) In the basic neuron/perceptron architecture the input features are applied to the input nodes and are weighted 
by the respective weights that detine the synapses. The bias term is then added on their linear combination and the 
resuit is pushed through the nonlinearity. In the McCulloch-Pitts neuron, the output fires a 1 for patterns in class co\ 
or a zero for class cco.. (B) The summation and nonlinear operation are merged together for graphical simplicity. 
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the output of the neuron indicates the class from which the input pattern originates. For the Heaviside 
nonlinearity, the output is 1 for patterns from class a>i and 0 for those from class a> 2 - 

Remarks 18.1. 

• ADALINE: Soon after Rosenblatt proposed the perceptron, Widrow and Hopf proposed the adaptive 
line element (ADALINE), which is a linear version of the perceptron [252]. That is, during training 
the nonlinearity of the activation function is not involved. The resulting scheme is the LMS algo- 
rithm, treated in detail in Chapter 5. It is interesting to note that the LMS was readily adopted and 
widely used for online learning within the signal processing and Communications communities. 

• A kernelized version of the perceptron algorithm has also been derived and the interested reader can 
obtain it via the book’s site under the Additional Material part that is associated with the current 
chapter. 


18.3 FEED-FORWARD MULTILAYER NEURAL NETWORKS 

A single neuron is associated with a hyperplane 

H 0\x\ T @2^2 “E ... "E OfXj ~E Oq — 0, 

in the input (feature) space. Moreover, classification is performed via the nonlinearity, which fires a 
one or stays at zero, depending on which side of H a point lies. We will now show how to combine 
neurons, in a layer-wise fashion, to construet nonlinear classifiers. We will follow a simple constructive 
proof, which will unveil certain aspects of neural networks. These will be useful later on, when dealing 
with deep architectures. 

As a starting point, we consider the case where the classes in the feature space are formed by unions 
of polyhedral regions. This is shown in Fig. 18.3, for the case of the two-dimensional feature space. 
Polyhedral regions are formed as intersections of halfspaces, each one associated with a hyperplane. 
In Fig. 18.3, there are three hyperplanes (straight lines in K 2 ), indicated as H \, Hi. // 3 , giving rise to 
seven polyhedral regions. For each hyperplane, the (+) and (—) sides (halfspaces) are indicated. In the 
sequel, each one of the regions is labeled using a triplet of binary numbers, depending on which side 
it is located with respect to H\ , H 2 , Hy. For example, the region labeled as (101) lies in the (+) side of 
H 1 , the (-) side of H 2 , and the (+) side of // 3 . 

Fig. 18. 4A shows three neurons, realizing the three hyperplanes, H 2 , and // 3 , of Fig. 18.3, 
respectively. The associated outputs, denoted as vi, V 2 , and >' 3 , form the label of the region in which 
the corresponding input pattern lies. Indeed, if the weights of the synapses have been appropriately set, 
then if a pattern originates from the region, say, ( 010 ), then the first neuron on the left will fire a zero 
(y\ = 0), the middle a one (yj = 1), and the right-most a zero (v ’3 = 0). In other words, combining the 
outputs of the three neurons together, we have achieved a mapping of the input feature space into the 
three-dimensional space. More specifically, the mapping is performed on the vertices of the unit cube 
in K 3 , as shown in Fig. 18.5. Each region of the input space uniquely corresponds to one vertex of the 
cube. In the more general case, where p neurons are employed, the mapping will be on the vertices of 
the unit hypercube in R /) . This layer of neurons comprises the first hidden layer of the network, which 
we are developing. 
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FIGURE 18.3 

Classes are formed by unions of polyhedral regions. Regions are labeled according to the side on which they lie, 
with respect to the three lines, H\, Hn, and H 3 . The number 1 indicates the (+) side and the 0 the (—) side. Class a>i 
consists of the union of the ( 000 ) and ( 111 ) regions. 




(A) (B) 


FIGURE 18.4 

(A) The neurons of the first hidden layer are excited by the feature values applied at the input nodes and form the 
polyhedral regions. (B) The neurons of the second layer have as inputs the outputs of the first layer, and they thereby 
form the classes. To simplify the figure, the bias terms for each neuron are not shown. 


An alternative way to view this mapping is as a new representation of the input patterns in terms of 
code words. For three neurons/hyperplanes we can form 2 3 binary code words, each corresponding to 
a vertex of the unit cube, which can represent 2 3 — 1 = 7 regions (there is one remaining vertex, i.e., 
(110), which does not correspond to any region). Note, however, that this mapping encodes information 
concerning some structure of the input data; that is, information relating to how the input patterns are 
grouped together in the feature space in different regions. 

We will now use this new representation, as it is provided by the outputs of the neurons of the 
first hidden layer, as input which feeds the neurons of a second hidden layer, which is constructed as 
follows. We choose all regions that belong to one class. For the sake of our example in Fig. 18.3, we 
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FIGURE 18.5 

The neurons of the first hidden layer perform a mapping from the input feature space to the vertices of a unit hyper- 
cube. Each region is mapped to a vertex. Each vertex of the hypercube is now linearly separable from all the rest 
and it can be separated by a hyperplane realized by a neuron. The vertex 110, denoted as an unshaded circle, does 
not correspond to any region. 


select the two regions that correspond to class a>i, that is, (000) and (111). Recall that all the points 
from these regions are mapped to the respective vertices of the unit cube in the K 3 . However, in this 
new transformed space, each one of the vertices is linearly separable from the rest. This means that we 
can use a neuron/perceptron in the transformed space, which will place a single vertex in the (+) side 
and the rest in the (—) one of the associated hyperplane. This is shown in Fig. 18.5, where two such 
planes are shown, which separate the respective vertices from the rest. Each of these planes is realized 
by a neuron, operating in R 3 , as shown in Fig. 18. 4B, where a second layer of hidden neurons has been 
added. 

Note that the output z\ of the left neuron will fire a 1 only if the input pattern originates from the 
region 000 and it will be at 0 for all other patterns. For the neuron on the right, the output Z 2 will be 1 for 
all the patterns coming from region (111) and zero for all the rest. Note that this second layer of neurons 
has performed a second mapping, this time to the vertices of the unit rectangle in IT 2 . This mapping 
provides a new representation of the input patterns, and this representation encodes information related 
to the classes of the regions. Fig. 18.6 shows the mapping to the vertices of the unit rectangle in the 
(zi,Z2) space. 

Note that all the points originating from class a >2 are mapped to (00) and the points from class o>\ are 
mapped either to (10) or to (01). This is very interesting; by successive mappings, we have transformed 
our originally nonlinearly separable task to one that is linearly separable. Indeed, the point (00) can 
be linearly separated from (01) and (10) and this can be realized by an extra neuron operating in the 
(zt, zi) space; it is known as the output neuron, because it provides the final classification decision. The 
final resulting network is shown is Fig. 18.7. We call this network feed-forward, because information 

















18.3 FEED-FORWARD MULTILAYER NEURAL NETWORKS 


911 



FIGURE 18.6 

The patterns from class a>\ are mapped either to (01) or to (10) and patterns from class an are all mapped to (00). 
Thus the classes have now become linearly separable and can be separated via a straight line realized by a neuron. 



FIGURE 18.7 

A three-layer feed-forward neural network. It comprises the input (nonprocessing) layer, two hidden layers, and 
one output layer of neurons. Such a three-layer neural network can solve any classification task, where classes are 
formed by unions of polyhedral regions. 


flows forward from the input to the output layer. It comprises the input layer, which is a nonprocessing 
one, two hidden layers (the term “hidden” is self-explained), and one output layer. We call such a neural 
network a three-layer network, without counting the input layer of nonprocessing nodes. 

We have constructively shown that a three-layer feed-forward neural network can, in principle, 
solve any classification task whose classes are formed by unions of polyhedral regions. Although we 
focused on the two-class case, the generalization to multiclass cases is straightforward, by employing 
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more output neurons depending on the number of classes. Note that in some cases, one hidden layer 
of nodes may be sufficient. This depends on whether the vertices on which the regions are mapped 
are assigned to classes so that linear separability is possible. For example, this would be the case if 
class w\ was the union of (000) and (100) regions. Then these two vertices could be separated from 
the rest via a single plane and a second hidden layer of neurons would not be required (check why). 
In any case, we will not take our discussion any further. The reason is that such a construction is 
important to demonstrate the power of building a multilayer neural network, in analogy to what is 
happening in our brain. However, from a practical point of view, such a construction has not much to 
offer. In practice, when the data live in high-dimensional spaces, there is no chance of determining the 
parameters that detine the neurons analytically to realize the hyperplanes, which form the polyhedral 
regions. Furthermore, in real life, classes are not necessarily formed by the union of polyhedral regions 
and more important classes do overlap. Hence, one needs to devise a training procedure based on a cost 
function and a set of training data. 

All that we have to keep from our previous discussion is the structure of the multilayer network; 
our focus will turn on seeking ways for estimating the unknown weights of the synapses and biases of 
the neurons. Flowever, from a conceptual point of view, we have to remember that each layer performs 
a mapping into a new space, and each mapping provides a different, hopefully more informative, rep- 
resentation of the input data, until the last layer, where the task has been transformed into one that it is 
easy to solve, 

18.3.1 FULLY C0NNECTED NETWORKS 

The feed-forward networks that have been introduced before are also known as fully connected net- 
works. This name is to stress out that each one of the neurons/nodes in any layer is directly connected 
to every node of the previous layer. The nodes of the first hidden layer are fully connected to those of 
the input layer. In other words, each neuron is associated with a vector of parameters, whose dimen- 
sion is equal to the number of nodes of the previous (input) layer. The algebraic operations which are 
performed are inner products. 

To summarize in a more formal way the type of operations that take place in a fully connected 
network, let us focus on, say, the rth layer of a multilayer neural network and assume that it comprises 
k r neurons. The input vector to this layer consists of the outputs at the nodes of the previous layer, 
denoted as y r ~ l . Let $ r , be the vector of the synaptic weights, including the bias term, associated with 
the / th neuron of the rth layer, where j = 1,2,... k, . The respective dimension is k r - \ + 1, where 
k r ~ i is the number of the neurons of the previous, r — 1, layer and the increase by 1 accounts for the 
bias term. Then the performed operations, prior to the nonlinearity, are the inner products 

3 = j = i,2,.^k r . 

Collecting all the output values into a vector, z r — [z\, z r 2 , ■ ■ ■ , z r k ] r , and stacking all the synaptic 
vectors as rows, one under the other, in a matrix ©, we can write collectively 

z r — ©/ _1 , where © := [6[, 0 r 2 , ..., 0 r kr ] T . 

The vector of the outputs of the rth hidden layer, after pushing each through the nonlinearity /, is 
finally given by 
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y 


r 



i 

f(Z r ) 


where the notation above means that / acts on each one of the respective vector components, individ- 
ually, and the extension of the vector by one is to account for the bias terms in the Standard practice. 

For large networks, with many layers and many nodes per layer, this type of connectivity turns out 
to be very costly in terms of the number of parameters (weights), which is of the order of k r k r -\. For 
example, if k r -\ — 1000 and k r = 1000, this amounts to an order of 1 million parameters. Note that 
this number is the contribution from the parameters of only one of the layers. However, a large number 
of parameters makes a network vulnerable to overfitting, when training is involved, as has already been 
discussed in Section 3.8. 

In contrast, one can employ the so-called weight sharing techniques (e.g., [181,23 1]), where a set of 
parameters are shared among a number of connections, via appropriately built-in constraints. The con- 
volutional networks, to be discussed in Section 18.12, belong to this family of weight sharing networks. 
As we will see, in a convolutional network, convolutions replace the inner product operations, which 
allows for a significant weight sharing that leads to a substantial reduction in the required number of 
parameters. 


Remarks 18.2. 

• Shallow and deep networks : From now on, we will refer to the number of layers in a network as the 
depth of the network. Networks with up to three (two hidden) layers are known as shallow , whereas 
those with more than three layers are called deep networks. 


18.4 THE BACKPROPAGATION ALGORITHM 

A feed-forward neural network consists of a number of layers of neurons, and each neuron is deter- 
mined by the corresponding set of synaptic weights and its bias term. From this point of view, a neural 
network realizes a nonlinear parametric function, y = fe(x), where 6 stands for ali the weights/biases 
present in the network. Thus, training a neural network seems not to be any different from training any 
other parametric prediction model. All that is needed is (a) a set of training samples, (b) a loss function, 
C(y, y), and (c) an iterative scheme, for example, gradient descent, to perform the optimization of the 
associated cost function (empirical loss; see Section 3.14, Chapter 3), 

N 

J (fi) = 'Y^C(y n , fe(, x n)). (18.6) 

n= 1 


The difficulty with training neural networks lies in their multilayer structure that complicates the 
computation of the gradients, which are involved in the optimization. Moreover, the McCulloch-Pitts 
neuron is based on the discontinuous Heaviside activation function, which is not differentiable. A first 
step in developing a practical algorithm for training a neural network is to replace the Heaviside acti¬ 
vation function with a differentiable approximation of it. 
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FIGURE 18.8 

The logistic sigmoid function for different values of the parameter a. 


The logistic sigmoid neuron: One possibility is to adopt the logistic sigmoid function, that is. 


f(z) = cr(z):= 


1 

1 + exp (—az) 


(18.7) 


The graph of the function is shown in Fig. 18.8. Note that the larger the value of the parameter a is, the 
closer the corresponding graph gets to that of the Heaviside function. Another possibility would be to 
use 

f(z) =atanh(y), (18.8) 

where c and a are controlling parameters. The graph of this function is shown in Fig. 18.9. Note that 
in contrast to the logistic sigmoid one, this is an antisymmetric function, that is, f(—z ) = — /(z). Both 
are also known as squashing functions, because they limit the output to a finite range of values. 


NONCONVEXITY OF THE COST FUNCTION 

Optimization of cost functions has been a recurrent theme that runs across this book. In Chapter 5, 
the gradient descent optimization algorithm was introduced and some of its convergence properties 
were outlined. In the sequel, the stochastic approximation framework for optimizing expectations of 
loss functions in an Online, sample after sample, mode of operation was discussed. The gradient de¬ 
scent scheme and the Online optimization rationale comprise the spine for a number of optimization 
algorithms that have been proposed and used for training feed-forward neural networks. The reason 
that we need to discuss these algorithms specifically in the context of neural networks is that the mul¬ 
ti lay er structure of these networks poses difficulties that have to be understood, prior to applying an 
off-the-shelf optimization scheme. 
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FIGURE 18.9 

The hyperbolic tangent squashing function for a = 1.7 and c = 4/3. 


A major difficulty when dealing with the minimization of a cost function, such as in Eq. (18.6), in 
the framework of neural networks, is its nonconvexity. Convex functions were discussed in Chapter 8. 
There, it was pointed out that for the case of a convex function, every local minimum is also a global 
one. Recall, for example, a typical graph of a convex function in Fig. 8.3. By the definition of a mini¬ 
mum of a function, we know that at this point the gradient (derivative in the simplest single-parameter 
space) becomes zero. Points where the gradient of a function becomes zero are known as critical or 
stationary points. 

However, if the function is not convex, a stationary point can belong to one of the following three 
categories: to be (a) a local minimum (maximum), (b) a global minimum (maximum), or (c) a sad- 
dle point; see Fig. 18.10 for the case of the one-dimensional space. All these stationary points are of 
importance in neural networks. A local minimum is the point where the value of the cost, /(#/), be¬ 
comes minimum within a region around 0/. The global minimum, 6 g , is the point where J(0 g )<J(0 ), 
V0 e R. A saddle point, 9 S , is neither a minimum nor a maximum, yet the derivative (gradient in gen- 
eral) is equal to zero. 

When adopting a gradient descent scheme to minimize a nonconvex cost function, the algorithm 
can converge to either a local or to a global minimum. Take, for example, the case of Fig. 18.10. Recall 
from Chapter 5 that the update rule of the gradient descent algorithm, in its one-dimensional version, 
becomes 


0(new) = 0(old) — ix-j- 

ClU 


e(old)’ 


and the iterations start from an arbitrary initial point, 0 t(>> . If at the current iteration the algorithm is, 
say, at the point 0(old) = 0 \, then it will move towards the local minimum, 0/. This is because the 
derivative of the cost at 0\ is equal to the tangent of <p \, which is negative (the angle is obtuse) and the 
update, 0(new), will move to the right towards the local minimum, 0/. In contrast, if the algorithm had 
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FIGURE 18.10 

A nonconvex function, besides the global, usually comprises a number of local minima and saddle points. To which 
minimum, out of the many, the algorithm will converge, depends on the point from which the algorithm was ini- 
tialized. However, if the value of the cost at a local minimum, e.g., J(6i), is not much larger than that of the global 
minimum, i.e., J(9 g ), then 0/ can be a satisfactory solution in practice. 


been initialized from a different point and the algorithm is currently at, say, 0(old) = (h, the update will 
move towards the global minimum, Q g , since the derivative is now equal to the tangent of <j >2 , which is 
positive (the angle is acute). As we know from Chapter 5, the choice of the step size, /z, is critical for 
the convergence of the algorithm. 

In real problems in multidimensional spaces, the number of local minima can be large, so the 
algorithm can converge to a local one. However, this is not necessarily very bad news. If this local 
minimum is deep enough, that is, if the value of the cost function at this point, e.g., J(9i), is not much 
larger than that achieved at the global minimum, i.e., J(9 S ), convergence to such a local minimum can 
correspond to a good solution. In practice, one has to be careful with how to initialize an algorithm 
when dealing with nonconvex cost functions. We are going to discuss this issue later on. Also, the 
interplay between local, global, and saddle points, when the dimension of the parameter space becomes 
large, which is the case for deep networks, is discussed in Section 18.11.1. 

18.4.1 THE GRADIENT DESCENT BACKPROPAGATION SCHEME 

Having adopted a differentiable activation function, we are ready to proceed with developing the gra- 
dient descent iterative scheme for the minimization of the cost function. We will formulate the task in 
a general framework. 

Let (y n ,x n ), n = 1,2,..., N, be the set of training samples. Note that we have assumed multiple 
output variables, assembled as a vector. We assume that the network comprises L layers; L — 1 hidden 
layers and one output layer. Each layer consists of k r , r = 1,2,..., L, neurons. Thus, the (target/de- 
sired) output vectors are 


y n = [y,iuy n i,■ ■ ■,y n k L ] T eR kL , « = 1,2 
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For the sake of the mathematical derivations, we also denote the number of input nodes as k^: that is, 
ko — l, where l is the dimensionality of the input feature space. 

Let 0'i denote the vector of the synaptic weights associated with the /th neuron in the rth layer, 
with 7 = 1,2,..., k r and r = 1,2,..., L, where the bias term is included in 0k , that is. 


° r r=WW e h . e i k ,-f- 


(18.9) 


The synaptic weights link the respective neuron to ali neurons in layer k, _i (see Fig. 18.1 1). The basic 
iterative step for the gradient descent scheme is written as 



(18.10) 

(18.11) 


The parameter /x is the user-defined step size (it can also be iteration dependent) and J denotes the cost 
function. 



r — 1 r 


FIGURE 18.11 

The links and the associated variables of the 7 'th neuron at the rth layer; yf _I is the output of the kth neuron at the 
(r — l)th layer and 9j k is the respective weight connecting these two neurons. The dependence on index n has been 
suppressed for notational convenience. 

Update Eqs. (18.10) and (18.11) comprise the pair of the gradient descent scheme for optimization 
(e.g., Chapter 5). As previously stated, the difficulty in feed-forward neural networks arises from their 
multilayer structure. In order to compute the gradients in Eq. (18.11), for all neurons in all layers, one 
has to follow two steps of computations. 

• Forward computations: For a given input vector x n , n = 1, 2,..., N, use the current estimates of 
the parameters (synaptic weights) (0k (old)) and compute all the outputs of all the neurons in all 
layers, denoted as y r .\ in Fig. 18.1 1, the index n has been suppressed to unclutter notation. 
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• Backward computations: Using the above computed neuronal outputs together with the known target 
values, y„k, of the output layer, compute the gradients of the cost function. This involves L steps, 
that is, as many as the number of layers. The sequence of the algorithmic steps is given below: 

- Compute the gradient of the cost function with respect to the parameters of the neurons of the 

last layer, i.e., j = 1 , 2 . k L . 

30j 

- For r = L — 1 to 1, Do 

* Compute the gradients with respect to the parameters associated with the neurons of the rth 
layer, i.e., k = 1,2 ,,k r , based on all the gradients ' ri J +] , j = 1,2,, k r + 1 , with 

respect to the parameters of the layer r + 1 that have been computed in the previous step. 

- End For 

The backward computations scheme is a direct application of the chain rule for derivatives, and it 
starts with the initial step of computing the derivatives associated with the last (output) layer, which 
turns out to be straightforward. Then the algorithm "flows” backwards in the hierarchy of layers. This 
is because of the nature of the multilayer network, where the outputs, layer after layer, are formed as 
functions of functions. Indeed, let us focus on the output y r , of the kth neuron at layer r. Then we have 

y r k = f( 0 [ T y r ~ 1 )’ k = i,2,...,k r , 

where y r ~ l is the (extended) vector comprising all the outputs at the previous layer, r — 1, and / 
denotes the nonlinearity. Based on the above, the output of the /th neuron at the next layer is given by 


„' +i - 



f (&>■/-') 


where & r \0' l , . Hk r V denotes the matrix having as rows the weight vectors at layer r. One 

can easily spot what we called before as “a function of a function.” Obviously, this goes on as we 
move on in the hierarchy. This function-over-function-over-function structure is the by-product of the 
multilayer nature of the neural networks, and it is a highly nonlinear operation that gives rise to the 
complication for computing the gradients, in contrast to other learners that we have studied in previous 
chapters. 

However, one can easily spot that computing the gradients with respect to the parameters defining 
the output layer does not pose any difficulties. Indeed, the output of the /th neuron of the last layer 
(which is actually the respective current output estimate) is written as 



Since y L ~ l is known, after the computations during the forward pass, taking the derivative with respect 
to 9^ is straightforward; no function-over-function operation is involved here. This is why we start from 
the top layer and then move backwards. 

Due to its historical importance, the full derivation of the backpropagation algorithm will be given. 
Those of the readers who are not interested in the details can bypass this part in a first reading. 
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For the detailed derivation of the backpropagation algorithm, the squared error loss function is 
adopted as an example, i.e., 

N 

J(0) = J2 J nW’ ( 18 . 12 ) 

n= 1 

and 

1 kL 

Jn(0)=-J2(9nk-ynk f , ( 18 . 13 ) 

2 k= 1 


where y n k, k = 1 . 2 ,..., kj , are the estimates provided at the corresponding output nodes of the net- 
work. We will consider them as the elements of a corresponding vector, y n . 

Computation ofthe gradients : Let z r nj denote the output of the linear combiner of the /th neuron in 
the / th layer at time instant n, when the pattern x n is applied at the input nodes (see Fig. 18.11). Then 
we can write 


where by dehnition 


k T -1 k r _ i 

-/■ _ \ ' nr —1 _I_ ar _ \ ' nr r —1 _ ar T r— 1 

^nj / j ~ im ynm ' u jO / > ® ims nm " j J n » 

m= 1 m =0 


y'i~' :=[l>^r 1 >---’3^ 1 ] 7 ’, 


( 18 . 14 ) 


( 18 . 15 ) 


and y r n0 = 1, V r, n, and 0k has been defined in Eq. (18.9). For the neurons at the output layer, r — L, 
ynm = 9nm, m — 1,2,..., k L , and for r = 1, we have = x nm , m = 1,2,..., k Q \ that is, y° m are set 
equal to the input feature values. 

Hence, we can now write 


Let us now define 


Then Eq. (18.11) becomes 


3 J„ _ dJn Kj _ dJn 

3 0 r j Kj Mj 3 </" 


or_ 3 J n 

nj ' dz r ■ ' 

n J 


( 18 . 16 ) 


( 18 . 17 ) 


N 




n =1 


r= 1,2, 


( 18 . 18 ) 


Computation of SC: Here is where the heart of the backpropagation algorithm beats. For the com¬ 
putation of the gradients, 8C, one starts at the last layer, r = L, and proceeds backwards toward r = 1; 
this “philosophy” justifies the name given to the algorithm. 
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1. r = L: We have 


For the squared error loss function. 


, _ dJn 

nj ~ dz L ' 

n J 


Hence, 


l k L 2 

Jn = ^ {f (z nk) ~ ynk) ■ 
Z k= 1 


&nj ~ ( 9nj ynj)f ( Z n j ) 

= e n j f'{Znj), j — 1,2,... ,k L , 


( 18 . 19 ) 


( 18 . 20 ) 


( 18 . 21 ) 


where f' denotes the derivative of / and e nj is the error associated with the /th output variable at 
time n. Note that for the last laver, the computation of the gradient, <5A, is straightforward. 

2. r < L: Due to the successive dependence between the layers, the value of z' n J 1 influences ali the 
values z' nk , k = 1,2,..., k r , of the next layer. Employing the chain rule for differentiation, we get 


or 


However, 


where 


which leads to 


1 _ dJn _ vA 3/„ dz’ nk 

"j - a^-l - 2-^ Q z r 9 r- 
k =\ nk °'-nj 


3 Z 


nj 


r— 1 

nj 


kr f)^ r 

or—1 _ \ ' or °^nk 

°nj / v °nk ^ r—\' 
k= 1 dZ nj 


3 z r nk _ 3 (^= 0 e * m y™) 


3 z 


/■—t 

n j 


dz 


r-1 

nj 


y r 1 = f(z’ ' ) 

Jnm J ' '-nm /' 


dz 


= 0 T kj f'(z r n j l ), 

*nj 

and combining with Eqs. (18.22) and (18.23), we obtain the recursive rule 



( 18 . 22 ) 

( 18 . 23 ) 

( 18 . 24 ) 

( 18 . 25 ) 

( 18 . 26 ) 


( 18 . 27 ) 
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For uniformity with (18.21), define 



k, 


( 18 . 28 ) 


k= 1 


and we finally get 



( 18 . 29 ) 


The only remaining computation is the derivative of /. For the case of the logistic sigmoid function, 
it is easily shown to be equal to (Problem 18.2) 




( 18 . 30 ) 


The derivation has been completed and the backpropagation scheme is summarized in Algorithm 1 8.2. 

Algorithm 18.2 (The gradient descent backpropagation algorithm). 

• Initialization 

- Initialize all synaptic weights and biases randomly with small, but not very small, values. 

- Select step size /r. 

- Set yT =x n j, j = 1,2,..., ko := /, n — 1,2,..., N. 

• Repeat; Each repetition completes one epoch. 

- For n — 1,2,..., N, Do 

* For r = 1, 2,..., L, Do; Forward computations. 

• For j = 1,2,..., k r , Do 

Compute z r n j from (18.14) 

Compute y r nj = f(z r nj ) 

End For 

* End For 

* For j = 1,2,..., ki, Do; Backward computations (output layer). 

Compute 8%j from (18.21) 

* End For 

* For r — L, L — l,... ,2, Do; Backward computations (hidden layers). 

For j' = l,2,..., k r . Do 

Compute S' n J 1 from (18.29). 

End For 

* End For 

- End For 

- For r = 1, 2,..., L, Do; Update the weights. 
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* For j = 1,2,... ,k r . Do 

Compute AB'- from (18.18) 

9 r j = 0'j + AO r j 

* End For 
- End For 

• Until a stopping criterion is met. 

The backpropagation algorithm can claim a number of fathers. The popularization of the algorithm 
is associated with the classical paper [199], where the derivation of the algorithm is provided. However, 
the algorithm had been derived much earlier in [250]. The idea of backpropagation also appears in [26] 
in the context of optimal control. 

Remarks 18.3. 

• A number of criteria have been suggested for terminating the backpropagation algorithm. One pos- 
sibility is to track the value of the cost function, and stop the algorithm when this gets smaller 
than a preselected threshold. An alternative path is to check for the gradient values and stop when 
these become small; this means that the values of the weights do not change much from iteration to 
iteration (see, for example, [124]). 

• As is the case with all gradient descent schemes, the choice of the step size, /z, is very critical; it has 
to be small enough to guarantee convergence, but not too small; otherwise the convergence speed 
slows down. The choice depends a lot on the specific problem at hand. Adaptive values of /z, whose 
value depends on the iteration, are also popular alternatives and they will be discussed soon. 

• Due to the highly nonlinear nature of the neural network problem, the cost function in the parameter 
space has, in general, a complicated landscape and there are a number of local minima, where the 
algorithm can be trapped. If such a local minimum is deep enough, the associated solution can 
be acceptable. However, this may not be the case and the algorithm can be trapped in a shallow 
minimum resulting in a bad solution. Ideally, one should reinitialize randomly the weights a number 
of times and keep the best solution. Initialization has to be performed with care; we discuss this 
issue later on. A discussion concerning local and global minima in the context of deep networks is 
discussed in Section 18.11.1. 

Pattern-by-Pattern/Online Scheme 

The scheme discussed in Algorithm 18.2 is of the batch type, where the weights are updated once per 
epoch, that is, after all N training samples have been presented to the algorithm. An alternative route is 
the pattern-by-pattern version; for this case, the weights are updated every time a new training sample 
appears in the input. Sometimes, such schemes are known as online. Yet, we will avoid to use this 
term, because it usually refers to the case where streaming data are considered and observations are 
continuously coming in, instead of having a fixed sized training set where data points are considered 
repeatedly, epoch after epoch. Online algorithms have been considered in Chapters 5 and 8. Another 
name that is more recently used to describe such algorithms is stochastic optimization. This is a rem- 
iniscent of the stochastic approximation method, introduced in Chapter 5, where in order to minimize 
the expected loss, individual observations are sequentially used to update the estimates of the unknown 
parameters/weights. Of course, keep in mind that, in the true stochastic gradient descent, there is no 
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data reuse (i.e., epoch after epoch) and data are assumed to arrive in a streaming way and convergence 
is achieved asymptotically. 

Pattern-by-pattern versions exploit better the training set, when redundancies in the data are present 
or training samples are very similar. Averaging, as is done in the batch mode, wastes resources, be- 
cause averaging the contribution to the gradient of similar patterns does not add much information. In 
contrast, in the pattern-by-pattern implementations, all examples are equally exploited, inasmuch as an 
update takes place for each one of the training samples. 

The pattern-by-pattern mode leads to a less smooth convergence trajectory; however, such a ran- 
domness may have the advantage of helping the algorithm to escape from local minima. 

Minibatch Schemes 

An intermediate way, where the parameter updates are performed every K < N samples, has also been 
considered, which is referred to as a minibatch or stochastic minibatch. However, more and more and 
in a rather abuse of terminology, such schemes are referred to as simply stochastic. The choice of 
the minibatch size, K , is influenced by a number of factors and it depends on the application and the 
specihc data as well as the available computing power. 

For example, multicore architectures are underutilized if the batch size is very small. Also, when 
GPUs are used, employing minibatch sizes of power of 2 can offer better run times. For such cases, 
some typical values are ranging from 16 to 256 , depending on the size of the training set. On the other 
hand, the use of small minibatch sizes can have a regularizing effect, due to the noise that is implicitly 
added in the computation of the gradients. However, in such cases, one has to use smaller values for 
the step size, for stability reasons, and this can increase the overall training time. 

Batch and minibatch schemes have an averaging effect on the computation of the gradients. In 
[213], it is advised to add a small white noise sequence to the training data, which may have a beneficial 
effect for the algorithm to escape from poor local minima. 

To exploit randomness even further in the pattern-by-pattern/minibatch schemes, it is advisable that 
prior to the pass of a new epoch, the sequence in which data are presented to the algorithm is random- 
ized (see, for example, [79]). This has no meaning in the batch mode, because updates take place once 
all data have been considered. This random shuffling of the data “breaks” possible correlations that 
underlie successive samples, making successive gradient computations independent. Such correlations 
may be due to specihc experimental protocols that have been followed for generating the data. Al- 
though it is advisable to shuffle the data for every epoch, this may not be practical when large data sets 
are used, e.g., of the order of a billion. In such cases, even a single data shuffling in the beginning is 
beneficial. 

These days, in the context of deep learning with large data training sets, minibatch schemes seem 
to be the ones that are favored for most applications. 

A major problem in all gradient descent schemes and their stochastic variants is to select the step 
size or learning rate , /i. As we know from Chapter 5, if its value is small the learning curves are 
smooth, but convergence can be slow. In contrast, if its value is relatively large, learning curves tend to 
be oscillatory but convergence can get faster, provided that the value does not become large enough for 
the algorithm to diverge. In practice, time-varying step sizes are more appropriate and they are in line 
with the stochastic gradient rationale. One can employ various strategies, yet these strategies are more 
an engineering “art” than the resuit of a theoretical analysis. It is advisable for a user to first look at 
what previous researchers/practitioners have done in similar situations and start with this experience. 
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A possible strategy is to start training with a linearly decreasing step size and then switch to a fixed 
one after a number of iterations. For example, one can proceed with a rule such as 

/A = (1 -aO/io + anii, 

where a,- = i/I for some fixed iteration I. After iteration /, the step size is fixed. The issue now is to 
set the parameters //q, ///, and /. In practice, one has to monitor the learning curve for a number of 
iterations first and adjust the values of the parameters accordingly as a tradeoff between how fast or 
how slow the initial iterations move towards convergence and at the same time being cautious about 
instabilities. 

18.4.2 VARIANTS OF THE BASIC GRADIENT DESCENT SCHEME 

The basic gradient descent scheme inherits all the advantages (low computational demands per iteration 
step) and all the disadvantages (slow convergence rate) of the gradient descent algorithmic family, as it 
was first presented in this book in Chapter 5. To speed up the convergence rate, a lot of research effort 
has been invested and a large number of variants of the basic gradient descent backpropagation scheme 
have been proposed. In this section, we provide some directions, which at the time the current edition 
is being compiled seem to be the trend in practice. 

For simplicity, and having derived the backpropagation descent scheme, let us get rid of the in¬ 
dices /■ and j referring to the layer and the specific neuron in the respective layer. Instead, the 
parameter/weight vector 0 will be used that comprises all the parameters of the network. Also, the 
corresponding estimate at the ith iteration of a learning algorithm will be denoted by 0 ir> . 

Let C(y,x,0) be the adopted loss function. The latter measures the deviation between the true 
output, y, and the corresponding predicted value, y, which depends on the respective input, jr, as well 
as the parameters that define the network, 0. The cost function, over all the training points, is given by 

N 

J (0) = y^£(y n ,x n ,0). 

n =1 


If batch processing is employed, then at each iteration the gradient of the above cost function is used 
for the update of the parameters, i.e., 3 ^ , which involves all the training points. That is, for every 
epoch, only one parameter vector update is performed. On the other extreme, if a pattern-by-pattern 
scheme is adopted, at each iteration the updates are estimated on the basis of the gradient of the loss 
function computed at the current input-output pair, i.e., 3 . That is, in this case, N updates for 

the parameters are obtained for every epoch. 

The minibatch operation needs a bit more of elaboration. Let us assume that the data set is split in 
M minibatches, each of size K. Obviously, N should be a multiple of K. Assuming that the data have 
been randomly shuffled, then at the current iteration epoch, the N input-output data pairs will appear 
in the algorithm in the following sequence: 

Cy(i).*(i)). • • • > (y(.K),X(K)) ■ ■ ■ Oy((M-i)/f+i),*((iW-i)ii:+i))> • ■ ■ > (;y (#>.*(#)). 

-V--' -v—- J 

tst minibatch Mth minibatch 
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where, as we have already pointed out in Section 18.2, the notation (i) e {1,2,..., N} is used, due 
to the data shuffling. Thus, during one epoch, M updates will be performed, each one involving the 
gradient of a different cost; this is because a different minibatch will be considered each time, i.e., 


dJ (m H0) , „ 

-, m — 1 , 2 ,..., M , 

80 

K 

J [m] (0) ^ C.(y(( m -i)K+k),X((m-\)K+k), 


(18.31) 

(18.32) 


In the sequel, to simplify notations, we will use a common Symbol, IT, to denote the gradient operation 
for all three possible cases discussed before. The true value of the function in the place of J will depend 
on (a) the specific scheme (i.e., batch, minibatch, pattern-by-pattern) and (b) the current iteration step, 
as different samples are involved in each step (e.g., in the stochastic versions). 


Gradient Descent With a Momentum Term 

One way to improve the convergence rate, while remaining within the gradient descent rationale, is to 
employ the so-called momentum term [72,247]. The correction term as well as the update recursion are 
now modified as 


A0 (i) 

0(0 


aA0 { ' 1( — [i — 




e 


(i-l) . 


AO 


(0 


(18.33) 


starting from an initial value for A 0 {O \ e.g., 0. The gradient is computed via the backpropagation and a 
is the so-called momentum factor. In other words, the algorithm takes into account the correction used 
in the previous iteration step as well as the current gradient computations. Its effect is to increase the 
step size in regions where the cost function exhibits low curvature. The step size implicitly increases 
when a number of successive gradients point in exactly the same direction. Assuming that the gradient 
is approximately constant over, say, I successive iterations, it can be shown (Problem 18.3) that using 
the momentum term, the corresponding correction becomes 

Ae^--^—g. (18.34) 

1 — a 


where g is the gradient value over the I successive iteration steps. That is, the step size has been 
increased by the factor . Typical values of a are in the range of 0.5 to 0.9. In essence, the use of a 
momentum term helps to dampen the zig-zags of the convergence trajectory, as discussed in Chapter 5 
(see Fig. 5.9). It has been reported that the use of a momentum term can speed up the convergence rate 
with a factor of up to two [214]. Experience seems to suggest that the use of a momentum factor helps 
the batch mode of operation more compared to its stochastic counterparts. 

Iteration dependent step size: A heuristic variant of the previous version results if the step size is 
left to vary as iterations progress. A rule is to change its value according to whether the cost function in 
the current iteration step is larger or smaller than the previous one. Let J il> be the computed cost value 
at the current iteration. Then if J (l} < the learning rate is increased by a factor of r,. If, on the 

other hand, the new value is larger than the previous one by a factor larger than c, then the learning 
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rate is reduced by a factor of rd- Otherwise, the learning rate remains unaltered. Typical values for the 
involved parameters are r, = 1.05, r r j = 0.7, and c = 1.04. For iteration steps where the value of the 
cost increases, it is advisable to set the momentum factor, a , equal to zero [240]. 

Such techniques, also known as adaptive momentum , are more appropriate for batch processing, 
because for Online versions the values of the cost tend to oscillate from iteration to iteration. 

Using different step size for each weight : It is beneficial for improving the convergence rate, to 
employ a different step size for each individual weight; this gives the freedom to the algorithm to 
exploit better the dependence of the cost function on each direction in the parameter space. In [109], 
it is suggested to increase the learning rate, associated with a weight, if the respective gradient value 
has the same sign for two successive iterations. Conversely, the learning rate is decreased if the sign 
changes, because this is indicative of possible oscillation. 

In the sequel, more recent schemes that use iteration dependent as well as different for each weight 
step sizes will be discussed. 

Example 18.1. In this toy example, the capability of a multilayer perceptron to classify nonlinearly 
separable classes is demonstrated in the two-dimensional space, where visualization is possible. The 




FIGURE 18.12 

(A) Error convergence curves for the adaptive momentum (red line) and the momentum algorithms, for Exam¬ 
ple 18.1 . Note that the adaptive momentum leads to faster convergence. (B) The decision line implemented by the 
classifier that results from the adaptive momentum algorithm. 

classification task consists of two classes, each being modeled by a four-component Gaussian mixture 
model, where the covariance matrix for each component is a 2 I, with cr 2 = 0.08. The mean values 
are different for each of the Gaussians. Specifically, the samples of the class denoted by a red o (see 
Fig. 18.12) are spread around the mean vectors 

[0.4,0.9] r , [2.0,1.8] r , [2.3,2.3] r , [2.6, 1.8] 7 ’, 

and those of the class denoted by a black + around the values 

[1.5, l.Of, [1.9, l.Of, [1.5,3.0f, [3.3, 2.6] 7 ’. 

A total of 400 training vectors were generated, 50 from each distribution. A multilayer perceptron with 
three neurons in the first and two neurons in the second hidden layer were used, with a single output 







18.4 THE BACKPROPAGATION ALGORITHM 927 


neuron. The activation function was the logistic one with a — I and the desired outputs 1 and 0 , respec- 
tively, for the two classes. Two different algorithms were used for the training, namely, the momentum 
and the adaptive momentum. After some experimentation, the parameters employed were (a) for the 
momentum /r = 0 . 05 , a = 0.85 and (b) for the adaptive momentum /i = 0 . 01 , a — 0 . 85 , r, = 1 . 05 , 
c — 1 . 05 , r c i — 0 . 7 . The weights were initialized by a uniform pseudorandom distribution between 0 
and 1. Fig. 18.12A shows the respective output error convergence curves for the two algorithms as a 
function of the number of epochs. The respective curves can be considered typical and the adaptive mo¬ 
mentum algorithm leads to faster convergence. Both curves correspond to the batch mode of operation. 
Fig. 18.12B shows the graph of the decision line of the resulting classifier, using the parameters/weights 
that are estimated via the adaptive momentum training. 

Nesterov’s Momentum Algorithm 

A variant of the celebrated Nesterov algorithm [169] was suggested in [223]. The update rule for the 
correction term becomes 


m a n d J (0 + a A0 (i ~ l) ) 

A0 (l) =aA0 {l ~ l) - n — y ' 


d0 


,0 — 1 ) 


The difference with the previous momentum update rule lies in that the gradient of the cost function 
is computed after pushing 0 in the direction of the current correction term, A0 ^ l ~ 1 In words, the 
gradient is not computed at the currently available estimates of the parameters but at their approximate 
future values. As has been reported in [223], this can lead to substantial gains in convergence speed. 
The values for a are in the same range as in the case of the momentum algorithm. 

The AdaGrad Algorithm 

The AdaGrad algorithm was introduced in its Online form in the convex cost functions setting in Chap- 
ter 8, Eq. (8.71). Borrowing the main rationale behind the algorithm, in our current framework it can 
be rephrased as 


where by dehnition 


g 


(i) 


3 J 
~d0 


0(0 


is the gradient computed at the respective estimate, 0 {, \ and the matrix G, is the of sum of the outer 
products of ali the gradient vectors up to and including iteration i, that is, 


G;:= J> ( V )T 
1=0 


In practice, matrix G/ is taken to be diagonal; hence the inverse of its root becomes computationally 
attractive. In such a case, if D is the total number of parameters of the network, the D x D matrix G, 
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involves on its diagonal the squares of the respective gradients, i.e., 

i j 

Gi(d,d) = , d= 1,2. D, 

t =o 


where gd denotes the dlh component of the respective gradient at iteration (t), i.e., g ( p — -TA, and Oj 

3d d 

is the Wth element of 0 e R w . Usually, a small constant e is used to avoid possible division by zero, 
and the corresponding updates for each one of the parameter vector components take the form 
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(18.35) 


The value of e can be of the order of 10 -8 or so. In words, the step size in the AdaGrad is iteration 
dependent and also different for each component. The value of /i is usually set close to 0.01. The 
correction for each individual parameter is inversely proportional to the values of its gradients. Larger 
values lead to smaller step sizes. Thus, as iterations evolve, each dimension evens out over time. This 
can be beneficial when training deep networks, because the scale of the gradients can vary a lot in 
magnitude in different layers. 

However, AdaGrad’s main drawback is the accumulation of the squared gradients in the denomina¬ 
tor, which can freeze the updates after a number of iterations. 

The RMSProp With Nesterov Momentum 

In the RMSProp, the sum of the squares of the gradients used in AdaGrad are replaced by a recursively 
defined decaying average of ali past squared gradients. In such a way, the remote past samples are 
exponentially discarded. At the same time, a Nesterov-type momentum rationale is employed. The 
main recursions are summarized as follows: 
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where /3 is a user-defmed parameter. The first of the recursions computes the gradient after the 
Nesterov-type correction. The second one updates the squares of the gradient values in the decaying av¬ 
erage rationale. The “o” operation denotes element-wise multiplication. The third recursion computes 
the correction term in the AdaGrad rationale and the fourth one provides the updates. The algorithm 
starts with initial conditions on v l0> — 0 , A0 lly ' and 0 {{)} . Some typical values for the hyperparameters 
are e = 10 -8 , /1 = 0.9, and /x = 0.001. 
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The RMSProp was proposed by Hinton et al. in a lecture. A simpler version, the RMSProp, does 
not use the Nesterov step and it coincides with the so-called AdaDelta rule, which was independently 
proposed in [260]. The scheme is similar to the AdaGrad with the difference that the decaying average 
for the square gradients is employed. 

The Adaptive Moment Estimation Algorithm (Adam) 

The Adam algorithm proposed in [121] borrows the concept of forgetting past values of the squared 
gradients, yet it introduces and propagates the values of the gradients as well, in a similar way as the 
momentum algorithm does. Furthermore, it introduces some important normalization that takes care of 
bias that may be introduced. The major recursions around which the Adam evolves are 


g 

/n ( '' 


dJ 

~d0 o"-')’ 

P\m ( '~ l> + (1 - Pi)g, 
p 2V (l ~ 1) + (1 ~ Pl)g°g, 


where o denotes element-wise multiplication and p\ and pn are user-defined parameters. In the sequel, 
the obtained values of the moments are normalized, i.e.. 
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These normalizations account for the tendency of the two gradient moments to be biased towards zero, 
especially during the early iterations, due to their zero value initialization. As iterations proceed, the 
normalizing coefficients tend to one. The update for the </th component of the parameter vector is 
computed as 
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The Adam algorithm is given in Algorithm 18.3. 


Algorithm 18.3 (The Adam algorithm). 

• Initialization 

- Initialize 0 {Q) . 

- Initialize, v l(y> = 0, m® = 0. 

- Select step size /x; Typical value 0.001. 

- Select P\ and /F; Typical values 0.9 and 0.99, respectively. 

- Select e; Typical value 10 -8 . 

- Set i = 0. 


2 


www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf. 
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• While a stopping criterion not met Do 

- Select a minibatch of K samples, e.g., 

*((m —1)AM-1)), • ■ • > (jfm K), X(mK)) 

- Compute the gradient 


g = 


dJ (m) 

ae 




(see Eq. (18.32)). 
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- End Do 
• End While 


Remarks 18.4. 

• At the time this edition of the book is compiled, the most popular algorithms for training deep neural 
networks seem to be the Adam, the simple stochastic gradient with or without momentum, and the 
RMSProp with or without Nesterov’s momentum term. It ali depends on the application and also 
the familiarity of the user. Furthermore, the use of minibatches seems to be the trend. 

• Besides the Standard forms of the previous schemes, various techniques have been proposed for 
their more efficient running and utilization. For example, one can employ warm restarts during 
training (e.g., [141]) or use ensemble combining techniques (Section 7.9 and, e.g., [101]). The idea 
in the latter is to train a single network, converging to several local minima along its optimization 
path, and saving the model parameters. These are then combined in an ensemble rationale known 
as snapshot ensembling. 

Some Practical Hints 

Training a neural network stili has a lot of practical engineering flavor compared to mathematical 
rigorousness. In this section, some practical hints are presented that experience has shown to be useful 
in improving the performance of the backpropagation algorithm (see, for example, [131] for a more 
detailed discussion). 


Preprocessing the input features/variables: It is advisable to preprocess the input variables so they 
have (approximately) zero mean over the training set. Also, one should scale them so they ali have 
similar variances, assuming that ali variables are equally important. In the case that the nonlinearity 
used is of a squashing type, it is advisable that their variance should also match the range of values 
of the activation (squashing) function. Moreover, it is benehcial for the convergence of the algorithm 
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if the input variables are uncorrelated. A way to achieve this is via an appropriate transformation, for 
example, PCA. 

Selecting symmetric activation junctioris: In the cases where squashing types of activation functions 
are employed, it is desirable that the outputs of the neurons assume equally likely positive and nega¬ 
tive values. After all, the outputs of one layer become inputs to the next. To this end, the hyperbolic 
activation function in Eq. (18.8) can be used. Recommended values are a = 1.7159 and c = 4/3. These 
values guarantee that if the inputs are preprocessed as suggested before, that is, to be normalized to 
variances equal to one, then the variance at the output of the activation function is also equal to one 
and the respective mean value equal to zero. However, as we will soon see in Section 18.6.1, currently, 
such activation functions seem to be less popular. 

Target values: The target values should be carefully chosen to be in line with the activation function 
used. The values should be selected to offset by some small amount the limiting value of the squashing 
function. Otherwise, the algorithm tends to push the weights to large values and this slows down the 
convergence; the activation function is driven to saturation, making the derivative of the activation 
function very small, which in turn renders small gradient values. For the hyperbolic tangent function, 
using the parameters discussed before, the choice of ± 1 for the target class labeis seems to be the right 
one. Note that in this case, the saturation values are a — ±1.7158. In Section 18.5, we will see that the 
choice of the output activation function should be dictated by the adopted optimality criterion. 

Initialization: Initialization of the weight parameters for any optimization scheme is of major impor- 
tance when training neural networks. Starting from a “wrong” set of values can have a number of 
unwanted effects on training. Initialization can affect how fast or slow an algorithm converges. Also, 
the initial values play a critical part in whether the algorithm converges to a point of low or high cost 
(good or bad local minimum). Furthermore, the generalization performance can be affected by the ini¬ 
tialization. However, it must be pointed out that the theoretical underpinnings related to such issues are 
not yet ciear and a number of notions are not well understood and currently constitute ongoing research 
areas (see, also, the discussion in Section 18.11.1). 

Weights are randomly initialized to some small values. To this end, a number of different scenarios 
have been developed, and this is ciear by the number of options one is given when using Standard 
related Software packages. 

For example, if squashing-type activation functions are used and large initial values are assigned, 
then all activation functions will operate in their saturation point. This drives the gradients to very 
small values, which, in turn, slows down convergence. The effect on the gradients is the same when 
the weights are initialized to very small values. Initialization must be done so that the operation in each 
neuron takes place in the (approximate) linear region of the graph of the activation function and not 
in the saturated one. It can be shown ([131], Problem 18.4) that if the input variables are preprocessed 
to zero mean and unit variance and the tangent hyperbolic function is used with parameter values as 
discussed before, then the best choice for initializing the weights is to assign values drawn from a 
distribution with zero mean and Standard deviation equal to 


where is the number of inputs (synaptic connections) in the corresponding neuron. 
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A critical issue during initialization is the so-called symmetry breaking between neurons. It can 
easily be shown that if two hidden neurons use the same activation function (which is usually the case) 
and are connected to the same input, then if they are initialized with the same values, their gradients will 
be equal and after the descent iteration their updated values will remain to be equal. No doubt, this is an 
undesirable effect that leads to redundancies. This is the reason that setting all values initially to zero 
is not a good practice. Random initialization helps to avoid such scenarios. On the other hand, trying 
to avoid large values is in line with trying to avoid what is known as exploding gradient phenomenon, 
which is intrinsic to backpropagation (see Section 18.6). Thus, it is advisable to start with small random 
values. 

To implement the above, possible scenarios are the uniform or the Gaussian distributions. For ex- 
ample, in [58], it is suggested to initialize the weights via the uniform distribution 


•U\ 


' Molit 


' Wout 


where m,-„ and m out are the number of inputs and outputs at the respective layer r. The goal behind this 
heuristic is to initialize all layers to have the same activation variance and the same gradient variance. 
However, the derivation is based on the assumption of linear units. A number of variants of the above 
are also possible, e.g., with the draw from U( —J==, J— ). Also, instead of a uniform, the zero mean 


Gaussian of the same variance can be used or sometimes the truncated Gaussian. 

In contrast to the weights, biases can be set to zero, and this seems to be the most often used 
scenario. 

However, all the recipes should always be treated with some care and before using them it is a good 
idea to see what other researchers have used before in similar situations. Furthermore, the type of the 
adopted nonlinearity should also be considered (see, e.g., [77,113]). In [77], the weights are initialized 
by taking into account the size of the previous layer only. More specifically, it is suggested to draw 
samples from a zero mean (truncated) Gaussian with variance 2/m,„. It is reported that the latter is 
more appropriate when rectified linear units (Section 18.6.1) are employed. For hyperbolic tangent 
nonlinearities, the uniform distributions mentioned before seems to be a more popular choice. 


Batch Normalization 

A major difference while training multilayer structures compared to other single-layer models is that 
the distribution of the inputs for each layer changes during training, as the parameters of the previous 
layers are iteratively updated. This slows down the training process by requiring the use of lower 
learning rates/step sizes. Furthermore, it makes the training sensitive to the parameter initialization, 
especially in the presence of saturating nonlinearities. 

One way to address and cope with such phenomena is to employ the so-called batch normalization 
[104]. As has already been discussed, during preprocessing, it is advisable to scale the input variables 
to unit variance around a zero mean value, in order to avoid large differences in the dynamic range of 
the values of the various inputs. Batch normalization is inspired by this fact and tries to impose such a 
scaling on the activation values that are produced by the network in all the layers. There are different 
variants of the basic idea. For example, some apply the normalization prior to the nonlinearity and 
some after. We will follow the latter option to demonstrate the method. 
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Referring to Fig. 18.11, let us consider the output y ; of the j th neuron, where for notational sim- 
plicity the superscript indicating the respective layer, r, has been omitted. The activation yj comprises 
an input to ali the neurons of the next layer, r + 1. Furthermore, assume that minibatches of size K 
are used. Prior to updating the parameters to the descent direction, while the, say, mth minibatch is 
being considered, ali the values of the respective activations are tracked and their sample mean and 
corresponding variance are computed as 



i K 

- V y (k) 

K 

k= 1 
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(m) 



E(-«f 



(k) 

where y\ is the response of the /th neuron when it is excited by the kth sample of the respective (e.g., 
/m th) minibatch. Then the corresponding normalized activations are computed as 


(k) = 
j 



(18.36) 


In words, every activation is normalized to unit variance around a zero mean, prior to passing it to the 
layer above. 

However, batch normalization goes one step ahead. Instead of using the normalized variables in 
their previous primitive form, it imposes a further linear transformation, i.e.. 


yf = Yjyf + Pj, ( 18 - 37 ) 

where Yj and fij are learned during the training as extra parameters in the backpropagation framework. 
In such a way, we make up for the equal treatment that normalization implicitly applies to each neuron, 
and we offer the freedom to the network to adjust the expressive power of each neuron separately. 

It turns out that batch normalization allows for higher learning rates and speeds up convergence. 
Furthermore, it is reported to be less sensitive to initialization and can, also, act as a regularizer [104]. 
Further theoretical findings related to batch normalization can be found in, e.g., [123]. 

Once the network has been trained, one wonders how to employ the transformation given in 
Eq. (18.37), since Eq. (18.36) requires the mean and variance of a specific minibatch. In practice, once 
the parameters y and fi have been learned, the mean and Standard deviation are replaced by averages 
over the various minibatches, i.e., 
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and 


yj = Yj 


yj - 

a j + e 


+ Pj, 


where a small constant e has been used to avoid a possible division by zero and M is the total number 
of minibatches. Often, instead of er?, its corrected unbiased version (see, e.g., Problem 7.5) is used, 

i e -^cr 2 
ne., K _ j u j. 
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In practice, however, one uses running averages that are collected during training. For example, 
one can employ a momentum-based running average, and update the statistics after each minibatch has 
been processed, i.e., 

(ij( new) = afiji old) + (1 — m = 1,2 ,M, 

where a is a user-defined momentum parameter. A similar recursion applies to the variances. 


18.4.3 BEYOND THE GRADIENT DESCENT RATIONALE 

The other path to follow to improve upon the convergence rate of the gradient descent-based backprop- 
agation algorithm, at the expense of increased complexity, is to resort to schemes that involve, in one 
way or another, information related to the second-order derivatives. We have already discussed such 
families in this book, for example, the Newton family introduced in Chapter 6. For each one of the 
available families, a backpropagation version can be derived to serve the needs of the neural network 
training. We will not delve into details, because the concept remains the same as that discussed for 
the gradient descent. The difference is that now second-order derivatives have to be propagated back- 
wards. The interested reader can look at the respective references and also in [19,33,75,131,263] for 
more details. 

In [11,112,124], schemes based on the conjugate gradient philosophy have been developed, and 
members of the Newton family have been proposed in, for example, [13,191,244]. In all these schemes, 
the computation of the elements of the Hessian matrix, that is, 

3 2 / 


is required, where j and k run over all the parameters in the rth layer and j' and k! over all the 
parameters associated with the r'th layer, for all values of r and r'. To this end, various simplifying 
assumptions are employed in the different papers (see, also, Problems 18.5 and 18.6). 

An algorithm which is loosely based on Newton’s scheme has been proposed in [50], known as the 
quickprop algorithm. It is a heuristic method that treats the synaptic weights as if they were quasi- 
independent. It then approximates the error surface, as a function of each weight, via a quadratic 
polynomial. If this has its minimum at a sensible value, it is used as the updated value in the iterations; 
otherwise, a number of heuristics are mobilized. A common formulation for the resulting updating rule 
is given by 
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(18.39) 
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with typical values of the parameters used being 0.01 < /i < 0.6, and « nlax ~ 1.75. An algorithm in 
similar spirit with the quickprop has been proposed in [192]. 

In practice, when large networks and data sets are involved, simpler methods, such as carefully 
tuned gradient descent schemes and their versions, as for example those previously discussed in Sec- 
tion 1 8.4.2, seem to work better and are currently the trend. The more complex second-order techniques 
can offer improvements in smaller networks, especially in the context of regression tasks. 


18.5 SELECTING A COST FUNCTION 

As we have already commented, feed-forward neural networks belong to the more general class of 
parametric models; thus, in principle, any loss function we have met so far in this book can be em- 
ployed. Over the years, certain loss functions have gained in popularity in the context of regression and 
classification tasks. However, in the context of feed-forward multilayer networks, the choice of the loss 
function is tightly coupled to the type of the output nonlinearity that is used. A “wrong” combination 
can severely affect the speed at which the network learns during training. To understand this claim, let 
us consider the following combination. 

Adopt as loss function the squared error one and as output nonlinearity the logistic sigmoid func¬ 
tion, er(z), given in Eq. (18.7). For the sake of simplicity, let us assume that we only have a single 
output node and also suppress the indices n, which relates to observations, and r, which relates to 
layers. Then, if y is the target value and y the estimated one, the contribution to the cost function will 
be 

J = \ (y - y) 2 = ^ (y - o (z)) 2 , z = e T y, (18.40) 

where y is the (extended) vector of the outputs of the last hidden layer and 0 is the vector of synaptic 
weights connecting the nodes of the last hidden layer to the output one. The bias term has been absorbed 
in 0, according to Standard practice. Then the gradient of J with respect to 6, using the chain rule for 
derivatives, is trivially seen to be 


3 J_dJdz_ s _3/ 

30 ~~ 3z 30 _ y ’ ~dz’ 


where, after taking into account Eq. (18.40), we get 


3 J 3 v 
3 y 3 z 


(y-y)cr'(z). 


Thus, 

3 J „ . 

= (y ~ y)° (z)y. 

In words, the above means that the gradient of the cost, with respect to the parameters leading to the 
output node, depends on the error, (y — y), and also on the derivative of the logistic sigmoid function. 
The dependence on the error is “healthy.” Indeed, this is what we want. The larger the error, the larger 
the gradient (in absolute value); this will make the correction term in Eq. (18.10) large to account 
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for a large error or keep the update close to the current estimate when the error is small. However, 
the dependence on the derivative of the nonlinear function is not good news. Looking at the graph in 
Fig. 18.8, for values of the argument that are not close to zero, the function quickly saturates and its 
derivative becomes very small. Hence, although the error can be large, the respective correction will be 
small. Also, recall that in the backpropagation, the gradient of the cost with respect to the parameters 
of the last (output) layer, say, L, is passed to the previous one, layer L — 1, and is used to compute the 
respective gradients, and so on. If the values of the gradients of the last layer are small, this affects all 
the gradients in all layers and as a resuit convergence can significantly slow down. 

In contrast, one can readily check that if the output node is a linear one, instead of being sigmoid, 
then the combination of the squared error loss function with a linear output unit is a good combination. 
The resulting gradient becomes proportional to the error, since now the derivative becomes a constant. 

We will now focus on an alternative loss function that bypasses the previous drawback. Let us adopt, 
for a classification task, as targets the 0, 1 values. Then the true and predicted values, y nm , y nm , n — 
1, 2, ..., N, m = 1, 2, ..., ki, for k/ output nodes, can be interpreted as probabilities and a commonly 
used cost function in this setting is the cross-entropy, which is defined as 


N k L 

J = — y„k ln y„k '■ cross-entropy cost. 

n= 1 k= 1 


(18.41) 


Note that this is exactly the same cost used in Eq. (7.49) for the multiclass logistic regression treated in 
Chapter 7. For the classification task, when the number of classes are equal to the output nodes, i.e., we 
have kL classes, then the cross-entropy is the negative log-likelihood of the observations. Indeed, for the 
0, 1-class coding scheme for the observed label variables, y n k, the corresponding target output vector, 
y n e R* £ , will have a 1 at the position corresponding to the true class and the rest of the elements will 
be zero (one-hot vector). If we interpret y„k as the respective posterior probability, i.e., P(a>k |x„; 9), 
where 9 denotes all the parameters that define the corresponding network, then the likelihood of the 
observations is given by 


N k L 

P(yi,- ■ •, y N I*; 0) = f[ Yl^kY"*- (18.42) 

n =1 k=\ 

Taking the logarithm and changing the sign, Eq. (18.41) is obtained. 

A different formulation of the cross-entropy loss function, known as relative entropy , is sometimes 
used, i.e.. 


N k L 

j = - X! H ynk ln 

n= 1 k=l 


%k 

ynk' 


(18.43) 


This formulation brings up the affinity with the Kullback-Leibler (KL) divergence (Section 2.5.2) that 
measures deviation between probabilities (distributions). In our context, it measures how different y n k 
is from y n k- 
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Besides the previously defined cross-entropy cost function, a variant of it is also used and it is given 

by 


N k L 

'=-EE {ynk^y nk + (l V/i k) 1 n ( 1 Vn t:) ) i 

n= 1 k =1 


(18.44) 


whose minimum occurs when y nk = y nk , which for binary target values is equal to zero. Comparing 
the above cost in Eq. (18.44) with the cross-entropy in Eq. (18.41), it is easily seen that the former, 
while trying to “push” y nk towards 1 for the correct class (y nk = 1), at the same time it tries to “push” 
the corresponding values for the wrong classes towards 0, in order to minimize ./. Note that the cost in 
Eq. (18.44) can be seen as a generalization of the cross-entropy for the two-class case, using a single 
output neuron (see also Eq. (7.38)). Adopting the 0, 1-class coding scheme and assuming, this time, 
that the classes are not mutually exclusive and are independent, the counterpart of Eq. (18.42) becomes 


N k L 

P{y n \X; 0) = n “ y^ l ~ ynk ’ 

n= 1 k=\ 


which leads to Eq. (18.44). Note, however, that this interpretation does not hold true if classes are 
mutually exclusive, which is the case in most classihcation tasks. 

One can easily see that combining the above versions of the cross-entropy function with the logistic 
sigmoid nonlinearity “frees” the gradient of the cost, with respect to the synaptic weights of the A:th 
output neuron, of the dependence on the derivative of the respective activation function. It can easily 
be shown that (Problem 18.7) 


37 vA , 
^r-E njy n 

J n =1 




where S A = y n k(ynk — 1) for the cross-entropy loss in Eq. (18.41) and — y nk — y nk for the case of 
Eq. (18.44). Note that is the vector of the outputs of the last hidden layer, L — 1. 

It can be shown (Problem 18.9) that the cost functions in (18.41) and (18.44) depend on the rel¬ 
ative errors and not on the absolute errors, as is the case in the squared error loss; thus, small and 
large error values are equally weighted during the optimization. Furthermore, it can be shown that the 
cross-entropy belongs to the so-called well-formed loss functions, in the sense that if there is a solution 
that classifies correctly ali the training data, the gradient descent scheme will find it [2]. In [215], it is 
pointed out that the cross-entropy loss function may lead to improved generalization and faster training 
for classihcation, compared to the squared error loss. 

Having said ali that, recall that we have interpreted the outputs of the network as probabilities; yet, 
there is no guarantee that these add to one. This can be enforced by selecting the activation function in 
the last layer of nodes to be 


ex P(4> , 

£m= 1 ex P (zD 


softmax activation function, 


(18.45) 
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which is known as the softmax activation function [21]. Recall that Z ! n f,'=O r k y ! n ~ is the value of 
the linear combiner (prior to the nonlinearity) associated with the kth output neuron (layer r — L, 
Fig. 18.11) that corresponds to the ;zth input training sample. For those familiar with logistic regression, 
discussed in Chapter 7, compare (18.45) with Eq. (7.47); after ali, the world is small! 

As was the case with the sigmoid activation function, it is easy to show that combining the softmax 
activation with the cross-entropy loss function leads to gradients that are independent of derivatives of 
the nonlinearity (Problem 18.10). 


18.6 VANISHING AND EXPLODING GRADIENTS 

We have already discussed the importance of trying to avoid to involve nonlinearities in the output 
that promote small values of gradients, when they are combined with some loss functions. Let us now 
look at what is happening in the hidden layers. The secret lies in Eq. (18.27) of the backpropagation 
algorithm. Let us investigate it a bit more. As we have seen in the previous subsection, at the heart of 
the computation of the gradients lie the quantities 

= .. '• 

where the index n has been suppressed to unclutter notation, r is the index indicating the corresponding 
layer in an L-layer network, and j is the index of a neuron at the rth layer that consists of k r neurons. 
By definition, z'j := 0 r j y' is the linear output, prior to the nonlinearity /, of the jth neuron (asso¬ 
ciated with the parameter vector, 0'-) and y r ~ 1 is the vector of the output values of the (r — l)th layer 
(Fig. 18.1 1). Then the propagation of the various S -derivatives follows the recursive rule (Eq. (18.27)), 

«r 1 = (e^) /'O’ j = i ’ -• • • **'-i- 

In words, the 5-derivatives for layer r — 1 depend on the respective «5-derivatives of the layer above, r, 
and (a) on the respective weights and (b) on the derivative of the nonlinearity, both in a multiplicative 
way. For the sake of clarity, let us write the above formula for two successive steps. 


«r 1 = ^E (e S i + % + ^j ftfkWj j /'O’ 7 = 1,2,..., kr—i. 

All one has to keep in mind from the above is not its exact formulation, but the notion that as the 
backpropagation algorithm flows backwards, the derivatives ofthe nonlinearities as well as the weights 
are multiplied and the number of the involved products grows. The lower (closer to the input layer) in 
hierarchy S r . is, the more products are involved for its computation. There are two extremes that may, 
and often do, occur in practice. 

Taking into account that the derivatives of the activation function can be smaller than one (and in 
the case of the sigmoid-type nonlinearities can take very small values) if the estimated values of the 
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parameters are not very large, the gradients of the cost, with respect to the parameters of the lower 
layers, can take vanishingly small values and this can make the training extremely slow (see, e.g., 
[95,225] for further discussion and insight). This phenomenon becomes more prominent when deep 
networks with many layers are involved. 

On the other extreme, if the values of the parameters happen to get very large values, this can lead to 
an explosion of the gradient estimates. As a resuit, this can affect the learning by pushing the estimates 
of the parameters to the wrong region in the parameter space. 

Another related problem is that the gradients along the various layers may take values of different 
scale. This means that some layers can learn faster than others. This is basically a form of instability. 

All the above are some of the difficulties one encounters while training multilayer neural networks, 
especially when many layers are involved. 

To cope with such problems, a number of modifications and tricks have been developed. Playing 
with different cost functions is one, playing with different optimization variants of the backpropagation 
algorithm is another (see, e.g., [162] for a discussion). In the subsection to follow, we will see how to 
cope with nonlinearities to be used in the hidden layers, which are not of the saturating type; hence, 
they can “guard” us from the small derivative values tendency. 


18.6.1 THE RECTIFIED LINEAR UNIT 


Besides the activation functions that we have considered so far, the rectified linear unit (ReLU), defined 
as 


f(z) = max{0, z }: ReLU, 
has been proposed more recently (see Fig. 18.13). 


(18.46) 



FIGURE 18.13 

The graph of ReLU. Note that for z > 0 the derivative is equal to one and for z < 0 it is equal to zero. 

It has been reported that in the context of deep networks, the use of the ReLU nonlinearity in 
the hidden layers can signihcantly speed up the training time [126]. Such an activation function does 
not suffer from saturation and its derivative is equal to one when the neuron operates in its active 
region (z > 0). Thus, it is desirable to set the biases of the neurons, during initialization, to some 
small positive value, e.g., 8q = 0.1, in order to increase the probability that the input to the activation 
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function is positive. For negative values, the derivative is zero. At z = 0 the derivative is not defined 
and if this happens during training, one can either choose 1 or 0; for those familiar with the notion of 
the subgradient (Chapter 8) such a choice is fully justified. 

Thus, selecting the ReLU as the activation function, one bypasses problems related to the slowing 
down when derivatives get small values. Recall that in the backpropagation algorithm, when gradients 
with respect to the parameters of hidden layers are computed, the derivative of the activation function 
enters in a multiplicative fashion. 

A drawback of the ReLU is that training freezes when z < 0. To overcome this problem, variants of 
the ReLU have been proposed. For example, one can employ 

/(z ) = max{0, z} + a min{0, z}. 

Depending on the value of a , different variants resuit. For a = — 1, the so-called absolute value recti- 
fication is obtained [108]. If a is assigned a hxed small value, e.g., 0.01, the resulting nonlinearity is 
known as the leaky ReLU [150]. In [77], a is left as a parameter to be learned during training. A further 
modification, known as the max output unit, a number of, say, k different ReLUs are employed, whose 
parameters are learned during the training, and, each time, the one that results in the maximum value 
is selected to activate the corresponding neuron [60]. The ReLU was First introduced in the context of 
dynamical networks in [71] and it was motivated by biological arguments. 

The question that is now raised concerns the choice of the specific nonlinearity to be used in prac- 
tice. There is no definite answer to that and the choice depends on the specific application. At the time 
this edition is compiled, it seems that the use of ReLU, in any one of its versions, is the more popular 
choice for the hidden layers. For classification tasks and for the output layer, the softmax nonlinearity, 
combined with the cross-entropy cost function, is most commonly used. 


18.7 REGULARIZING THE NETWORK 

A crucial factor in training neural networks is to decide the size of the network. The size is directly 
related to the number of parameters to be estimated. Concerning feed-forward neural networks, two 
issues are involved. The First concerns the number of layers and the other the number of neurons per 
layer. As we will discuss in Section 18.11, a number of reasons support the use of more than two hid¬ 
den layers. Such networks comprise a large number of parameters and can be vulnerable to overfitting, 
which affects the generalization performance of the learner. One path to cope with overfitting is via 
regularization. Over the years, various regularization approaches have been proposed. A brief presen- 
tation and some guidelines are given below. We will return to overfitting and generalization issues in 
the context of deep networks and from a different perspective in Section 18.11.1. 

Weight decav: This path refers to a typical cost function regularization via the square Euclidean norm 
of the weights. Instead of minimizing a cost function, J{0), its regularized version is used, such that 
(e.g., [84]) 

J'(0) = J(O) + X\\O\\ 2 . (18.47) 

We have already discussed in Chapter 3 in the context of ridge regression that involving the bias terms 
in the regularizing norm is not a good practice, because it affects the translation-invariant property of 




18.7 REGULARIZING THE NETWORK 941 


the estimator. A more sensible way to regularize is to remove the bias terms from the norm. Although 
this simple type of regularization helps in improving the generalization performance of the network, 
and it can be sufficient for some cases, in general it is not the most appropriate way to go. 

Besides the square Euclidean norm, other norms can be used in Eq. (18.47). For example, an al- 
ternative would be to employ the l\ norm. From Chapter 9, we know that the latter norm promotes 
sparsity. This is a welcome property when training a large network so that one can push less informa- 
tive parameters to zero. 

Over the years, various scenarios in applying the regularized cost function have been proposed. One 
scenario is to group ali the parameters (excluding biases) for each layer together and apply Eq. (18.47), 
involving different regularizing constants for each group. Another scenario is to deal with the pa¬ 
rameters of each neuron separately. Another path is to look at regularization as being equivalent to 
constraining the norm to be less than a preselected value (Chapter 3). This turns out to be equivalent 
to a projection step onto the respective ball associated with the norm, e.g., the £2 ball (see Chapter 8), 
which leads to a normalization operation (see [91]). 

Weight elimination: Instead of employing the norm of the weights, another approach involves more 
general functions for the regularization term, that is. 


j'(6) = J(0)+Xh(O). 


(18.48) 


For example, in [246] the following is used: 



(18.49) 


where K is the total number of the parameters involved and 0/, is a preselected threshold value. A care- 
ful look at this function reveals that if Ok < 9i ,, the penalty term goes to zero very fast. In contrast, for 
values Ok > 0/,, the penalty term tends to unity. In this way, less significant weights are pushed toward 
zero. A number of variants of this method have also appeared (see, for example, [201]). 

Methods based on sensitivity analysis : In [130], the so-called optimal brain damage technique is 
proposed. A perturbation analysis of the cost function in terms of the weights is performed, via the 
second-order Taylor expansion, that is, 



(18.50) 


i=l 


;=1 


( = 1 j=l,j^i 


where 



dOidOj 
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Then, assuming the Hessian matrix to be diagonal and if the algorithm operates near the optimum (zero 
gradient), we can approximately set 



(18.51) 


i=l 


The method works as follows: 

• The network is trained using the backpropagation algorithm. After a few iteration steps, the training 
is frozen. 

• The so-called saliencies, defined as 



are computed for each weight, and weights with a small saliency are removed. Basically, the saliency 
measures the effect on the cost function if one removes (sets equal to zero) the respective weight. 

• Training is continued and the process is repeated every few iterations, until a stopping criterion is 
satisfied. 

In [74], the full Hessian matrix is computed, giving rise to the optimal brain surgeon method. Note that 
regularization techniques that remove connections are also known as pruning techniques. 

Early stopping : An alternative, primitive in concept yet particularly useful in practice, technique to 
avoid overfitting is the so-called early stopping. The idea is to stop the training when the test error 
starts increasing. Training the network over many epochs can lead the training error to converge to 
small values. However, this is an indication of overfitting rather than indicative of a good solution. 
According to the early stopping method, training is performed for some iterations and then it is frozen. 
The network, using the currently available estimates of the weights/biases, is evaluated against a vali- 
dation/test data set and the value of the cost function is computed. Then training is resumed, and after 
some iterations the previous process is repeated. When the value of the cost function, computed on the 
test set, starts increasing, then training is stopped. 

Early stopping is also used in combination with other regularization strategies. Even when using 
regularization strategies that modify the objective function, the decision when to stop training cannot 
be based solely on focusing on the training error. 

Regularization via noise injection: The regularization effect of the presence of noise during training 
has been well known since the early 1990s. There are various ways of adding extra noise to the input 
samples (e.g., [19]). This is equivalent to modifying the cost function by adding an extra term which 
acts as a regularizer. 

The other alternative is to add noise to the parameters as they are being estimated during training 
(see, e.g., [166]). Looking at the noise as a small perturbation on the parameters and employing a 
first-order Taylor expansion of the output, one can show that this is equivalent to adding an extra 
regularizing term to the cost function. Large gradient values are penalized and this pushes the algorithm 
to converge to Solutions in the parameter space, where they are relatively insensitive to small variations. 

Besides the two previous possibilities, there are strategies that attempt to cope with noisy output 
labeis. This is known as label smoothing (see, e.g., [228]). 
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Regularizing via the artificial expansion ofthe data set: As has already been discussed in various parts 
of the book (e.g., Chapter 3), the cause of overfitting is the large size (capacity) of the number of 
unknown parameters of the model, with respect to the size of the training set. Keeping the model fixed, 
while increasing the number of the training data, acts beneficially on the overfitting problem. However, 
labeled data may not be always practical to have. In such cases and in certain applications, a way out 
is to generate artificially “fake” data and use them as part of the training set. For example, in tasks 
such as object recognition and OCR, one can generate many replicas of the objects or digits and letters, 
by applying linear transformations, such as rotations, translations and scaling, on existing images. In 
[110], data augmentation has also been applied to speech recognition. Later on, in Section 18.15, we 
are going to discuss more recent techniques on how to artificially generate data. 

Example 18.2. The goal of this example is to show the effect that regularization has on overfitting. 
Fig. 18.14 shows the resulting decision lines that separate the samples of the two classes, denoted by 



FIGURE 18.14 

Decision curve (A) before pruning and (B) after pruning. 

black and red o, respectively. Fig. 18.14A corresponds to a multilayer perceptron with two hidden 
layers and 20 neurons in each of them, amounting to a total of 480 weights. Training was performed 
via the batch gradient descent backpropagation algorithm. The overfitting nature of the resulting curve 
is readily observed. Fig. 18.14B corresponds to the same multilayer perceptron; however, this time a 
pruning algorithm was employed. Specifically, the method based on parameter sensitivity was used, 
testing the saliency values of the weights every 100 epochs and removing weights with saliency value 
below a chosen threshold. Finally, only 25 of the 480 weights survived and the curve is simplified to a 
straight line. 

DR0P0UT 

This technique for regularizing deep networks follows a different concept from what we have already 
discussed. Its origin borrows ideas from the concept of combining learners, as has already been dis¬ 
cussed in Sections 7.8 and 7.9 and in particular to the bagging rationale. However, dropout cleverly 
modifies the basic idea of bagging to make it more efficient and suitable to large networks. 

The term “dropout” refers to dropping out units/nodes (in the hidden and input layers) in a neural 
network. At each iteration of the training algorithm, a number of nodes are removed (along with their 
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incoming and outgoing associated connections). The parameters of the remaining nodes are updated 
according to the updating rule. In other words, at each iteration step only a subset of the parameters 
are updated while the rest (the ones associated with the removed nodes) are frozen to their currently 
available estimates from the previous iteration. The subset consisting of the remaining nodes detines 
a subnetwork of the original larger one. The procedure is illustrated in Fig. 18.15. In Fig. 18.15A, the 





FIGURE 18.15 

(A) The full network. (B) The nodes to be removed are shown in red, together with all incoming and outgoing con¬ 
nections. (C) The red ones have been removed. The estimates of their associated parameters are frozen to the values 
obtained in the previous iteration and are not updated in the current iteration step. 

full network is shown. In Fig. 18.15B, the nodes to be removed are shown in red, together with all 
their incoming and outgoing connections. In Fig. 18.15C, the subnetwork to be updated in the current 
iteration involves only the gray nodes. The red ones have been removed and the estimates of their 
associated parameters are frozen to the values that have already been obtained in the previous iteration 
and which are not updated in the current step. In the figure, one of the removed nodes is an input one 
and the rest lie in hidden layers. 

The removal of nodes is carried out probabilistically. That is, each node is retained with probabil- 
ity P. Usually, the value of P is equal to 0.5 for the hidden layers and is set equal to 0.8 for the input 
layer nodes. For a network with, say, K nodes, the total number of possible subnetworks is 2 K . This 
is indeed a large number for large values of K , which is the case in practice. However, keep in mind 
that there is a high degree of parameter sharing among the various subnetworks and the total number 
of parameters to be learned is equal to the number of parameters that constitute the original network. 

Once the training phase has been completed and the backpropagation algorithm has converged, the 
obtained estimates are multiplied by the respective probability, P. This corresponds to an averaging 
operation over all possible subnetworks that have been trained [91]. It can be shown that (e.g., [64,91]) 
for a network with one hidden layer and K nodes with a softmax output unit for computing the prob- 
abilities of the class labeis, using the mean network is equivalent to taking the geometric mean of the 
probability distributions over labeis predicted by all 2 K possible networks. In [245], it is shown that 
applying the dropout rationale to a linear regression task is equivalent to an £2 regularization with dif¬ 
ferent regularizing weights per model parameter. However, this equivalence does not carry to general 
deep networks. 

A heuristic explanation on why dropout works is provided in [91]. Dropout reduces coadaptation 
of neurons, because in every iteration the parameters of different sets of neurons are updated. Thus the 
network is “forced” to learn more robust features. In other words, the network learns while, each time, 
parts of it are missing. A more theoretically pleasing explanation is given in [255], where it is shown 
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that the dropout technique is equivalent to an approximate variational inference with specific priors in 
a deep Gaussian process. 

At the time the current edition is compiled, the dropout technique seems to be the most popular 
method for regularization and has been applied to a wide range of applications (see, e.g., [210]). Also, 
using the dropout does not prevent to combine it with other types of regularization [91]. Needless to say 
that the method is not a “panacea” and has its drawbacks. A major shortcoming is that typically one has 
to train much larger networks to account for the loss of capacity imposed by the regularization. More- 
over, one needs more iterations for the training. A dropout network typically takes 2-3 times longer 
to train than a Standard neural network of the same architecture. To this end, besides the previously 
reported scheme, computational efficient alternatives have also been proposed (see, e.g., [241,243]). 


Test Error 



FIGURE 18.16 

The use of dropout during training has a significant effect on the generalization performance of the network. This is 
verified by the test error measured by the number of errors committed in the test set. The probabilities for retaining 
nodes in the hidden layers was P = 0.5 and for the input nodes P = 0.8. 


Example 18.3. Fig. 18.16 illustrates the effect of the dropout regularization on the test error, in the 
context of an optical character recognition (OCR) classification task. A fully connected feed-forward 
network is used with 784 nodes in the input layer, 2000 nodes in each one of the two hidden layers, 
and 10 output nodes, one per class. The inputs to the network come from the MNIST database for 
digit recognition. The input images are of 28 x 28 (784) and the pixel values were normalized in the 
range [0,1 ]. For the training, 55000 images were used and 10000 were kept for the testing phase. The 
nonlinearity used in the hidden layers was the ReLU and for the output nodes the softmax one. The loss 
function was the cross-entropy one. The Standard gradient descent algorithm was used with step size 
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/r = 0.01. The network was trained for 600 epochs and the minibatch size was equal to 100. To obtain 
the curves, the error over the test set is computed using the estimates obtained during the training phase, 
every time an epoch iteration has been completed. Observe that the use of dropout has a significant 
effect on the obtained test error. Also, a combination of dropping out hidden as well as input nodes 
has a beneficial effect. One should notice that although the use of dropout is beneficial, even without 
regularization, the network is doing quite well. We will come back to that in Section 18.11.1. 


18.8 DESIGNING DEEP NEURAL NETWORKS: A SUMMARY 

So far in our discussion, we have touched upon a number of challenges that one has to address while 
building up a deep neural network for a specific application. No doubt, a practitioner has to make a 
number of critical decisions prior to running the code, using, usually, one of the off-the-self Software 
packages. 

Adopting the neural network architecture and selecting the related hyperparameters is stili a task 
with a strong engineering flavor. It may not be extreme to compare this task to the task of building 
up a complex Circuit in the times before the development of related sophisticated Software packages. 
The engineer had to employ a number of tricks, which were learned rather from experience than from 
theory. The goal of this section is to summarize and put together the major challenges and some general 
tips that have to be followed. 

• Think of the problem at hand carefully, try to understand its specificities and the nature and the 
statistical properties of the data and the goals, prior to building up the network. Avoid to start 
playing with algorithms before making sure that the problem at hand has been understood. 

• Look at the data and make sure that they are well collected and highly representative of the problem. 
Make sure that there are no “biases” that favor certain decisions or classes. Biases of special concern, 
which are also the most difficult ones to identify, are those that take place at a subconscious level 
and are driven by social stereotypes, e.g., issues related to gender, race, religion, and social class. 

• Make sure to use the appropriate type of nonlinearities for the hidden as well as the output units. 
The right choice is also problem dependent, although we have already made some suggestions 
concerning the “typica!” case. Yet, “typical” does not necessarily mean “must.” 

• Make sure that the right loss function is selected to optimize the data. The related issues have already 
been discussed before. Yet, one may be more imaginative and need to use other loss functions, 
which are more appropriate for the problem at hand. One should not be, necessarily, biased to the 
few examples that are given in the book. In the so many decades of machine learning history, a 
number of alternatives have been suggested and used in the various scientific communities, where 
application-specific loss functions may be more appropriate. 

• Selecting the size of the network is the first challenge, that is, to decide the number of layers and 
the number of nodes per layer. If the network is too big, even the use of regularization may not be 
enough to cope with the overfitting issue. If it is too small, the performance may be poor. In some 
cases, one may have to reconsider and obtain more data, if possible. Before one starts developing 
the architecture, searching what others have done in similar situations and settings is advisable. 
Searching for the “best” size, it is a common practice to split the available data set into three parts, 
namely, the training set, the validation, set and the test set. 
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The validation set is used during training, for evaluating the comparative performance of different 
models, which are trained on the training set. This is a “hybrid” set that is used as an independent set 
to assist the designer, during the training phase, to select the model. Evaluating the final performance 
is entirely done via the test set, which has not participated in any way during training. The validation 
set should not be confused with the cross-validation method discussed in Chapter 3. Such techniques 
cannot be employed when large data sets are involved, due to computational timing constraints. 
Moreover, when large data sets are available, one can afford the split of the data set into parts to 
serve different purposes. 

• For each algorithm that will be used, one has to carefully select the hyperparameters of the employed 
stochastic gradient optimizer. For example, one has to follow the evolution of the algorithm’s con- 
vergence and may have to reconsider the choice that has been made. The choice of the minibatch 
size should also be carefully made, to serve the needs of the optimizing algorithm and at the same 
time to exploit the computational aspects of the specific architecture that is used to run the algo¬ 
rithm. 

• Initialize carefully the parameters and normalize the values obtained by the network appropriately, 
e.g., by batch normalization. Guidelines have already been given before. Yet, guidelines are to help 
and one may have to be more careful and imaginative. 

• Employ regularization. To this end, one has to experiment with the probabilities of removing nodes 
for the dropout or the involved regularizing parameters, if another method is involved. 

• Modern neural network architectures comprise millions of parameters; yet, their entailed compu- 
tations are highly parallelizable. GPUs are relatively cheap and ubiquitous hardware devices that 
can afford massive parallelization of the computation, as they comprise thousands of cores. Thus, 
acquiring multiple GPUs is a most needed investment. As the memory capacity of a GPU increases 
(or multiple GPUs are used), one can increase the minibatch size, which leads to faster training (due 
to parallelization) and often better convergence (due to large batches). 

• Exploit the experience gained by others in similar setups, when implementing and running an 
algorithm. Furthermore, keep in mind that the field is stili being developed and new results and 
techniques may appear any time. Thus, follow closely the advances as they happen. 

• Never run the algorithm in a black-box rationale. Try to understand what each one of the hyperpa¬ 
rameters means and how it can affect the performance. 

A final tip to the practitioner that I keep repeating to my students is: understand before you run. 


18.9 UNIVERSAL APPROXIMATION PROPERTY OF FEED-FORWARD 
NEURAL NETWORKS 

In Section 18.3, the classification power of a three-layer feed-forward neural network, built around 
the McCulloch-Pitts neuron, was discussed. Then, we moved on to employ smooth versions of the 
activation function, for the sake of differentiability. The issue now is whether we can say something 
more concerning the prediction power of such networks. It turns out that some strong theoretical results 
have been produced, which provide support for the use of neural networks in practice (see, for example, 
[36,54,96,105]). 
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Let us consider a two-layer network, with one hidden layer, involving sigmoidal nonlinearities and 
with a single output linear node. The output of the network is then written as 

K 

g(x) = Y d 0kfV h k x) + 0 (18.52) 
k= 1 

where the vector 6'/ consists of the synaptic weights and the bias term that define the kth hidden neuron 
and the superscript “o” refers to the output neuron. Then the following theorem holds true. 

Theorem 18.1. Let g (x) be a continuous function defined in a compact' subset S C RE Then, for any 
e > 0, there is a two-layer network with K(e) hidden nodes of the forni in Eq. ( 18.52 ), so that 

\g(x)-g(x)\<e, WxeS. (18.53) 

In [12], it is shown that the approximation error decreases according to an 0(1/K) rule. In other 
words, the input dimensionality does not enter into the scene and the error depends on the number 
of neurons used. The theorem States that a two-layer neural network is sufficient to approximate any 
continuous function; that is, it can be used to realize any nonlinear discriminant surface in a classifi- 
cation task or any nonlinear function for prediction in a general regression problem. This is a strong 
theorem indeed. Related universal approximation theorems have been proved for a more general class 
of activation functions, including the ReLU (see, e.g., [138,218]). 

However, what the theorem does not say is how big such a network can be in terms of the required 
number of neurons in the single layer. It may be that a very large number of neurons are needed to 
obtain a good enough approximation. This is where the use of more layers can be advantageous. Using 
more layers, the overall number of neurons needed to achieve certain approximation may be much 
smaller. We will come to this issue soon, when discussing deep architectures. 

Remarks 18.5. 

• Extreme learning machines (ELMs): These are single-layered feed-forward networks (SLFNs) with 
output of the form [97] 


K 

g K (x) = Y J e ?fWf x + b i)’ (18.54) 

i=t 

where / is the respective activation function and K is the number of hidden nodes. The main 
difference with Standard single-layer feed-forward networks is that the parameters associated with 
each node (i.e., 6 j' and /;,) are generated randomly, whereas the weights of the output function (i.e., 
Of) are selected so that the squared error over the training points is minimized. This implies solving 

N 

min {yn-gK(Xn)f. (18.55) 


3 


Closed and bounded. 
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Hence, according to the ELM rationale, we do not need to compute the values of the parameters 
for the hidden layer. It turns out that such a training philosophy has a solid theoretical foundation, 
as convergence to a unique solution is guaranteed. It is interesting to note that although the node 
parameters are randomly generated, for infinitely differentiable activation functions, the training 
error can become arbitrarily small if K approaches N (it becomes zero if K = N). Furthermore, 
the universal approximation theorem ensures that for sufficiently large values of K and N, gK 
can approximate any nonconstant piece-wise continuous function [98]. A number of variations and 
generalizations of this simple idea can be found in the respective literature. The interested reader is 
referred to, for example, [99,186] for related reviews. 


18.10 NEURAL NETWORKS: A BAYESIAN FLAVOR 

In Chapter 12, the (generalized) linear regression and the classification tasks were treated in the frame- 
work of Bayesian learning. Because a feed-forward neural network realizes a parametric input-output 
mapping, fg (jc), there is nothing to prevent us from looking at the problem from a fully statistical point 
of view. Let us focus on the regression task and assume that the noise variable is a zero mean Gaussian 
one. Then the output variable, given the value of fg (jc), is described in terms of a Gaussian distribution, 

p(y\0-p)=Af(y\f e (x),r l ), (18.56) 

where A is the noise precision variable. Assuming successive training samples, (y n ,x„), n = 
1, 2,..., N, to be independent, we can write 


N 

p(y\0; P) = fi (ynlfeixn), ■ 

n =1 


(18.57) 


Adopting a Gaussian prior for 0 , that is, 

p(0;a)=Af(0\0,a- 1 I), (18.58) 

the posterior distribution, given the output values _y, can be written as 

p(9\y)<xp(0-a)p{y\e-p). (18.59) 

However, in contrast to Eq. (12.16), the posterior is not a Gaussian one, owing to the nonlinearity of the 
dependence on 0. Here is where complications arise and one has to employ a series of approximations 
to deal with it. 

Laplacian approximation: The Laplacian approximation method, introduced in Chapter 12, is 
adopted to approximate p(0\y) to a Gaussian one. To this end, the maximum, #map, has to be com- 
puted, which is carried out via an iterative optimization scheme. Once this is found, the posterior can 
be replaced by a Gaussian approximation, denoted as q(9\y). 
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Taylor expansion ofthe neural network mapping : The final goal is to compute the predictive distri¬ 
butiori, 



( 18 . 60 ) 


However, although the involved PDFs are Gaussians, the integration is intractable, because of the 
nonlinear nature of fg . In order to carry this out, a first-order Taylor expansion is performed. 


fe(x) ~ /# MA P (x) + g T (0 - 0 map), 


( 18 . 61 ) 


where g is the respective gradient computed at 0 map, which can be computed using backpropagation 
arguments. After this linearization, the involved PDFs become linear with respect to 0 and the integra¬ 
tion leads to an approximate Gaussian predictive distribution as in Eq. (12.21). For the classification, 
instead of the Gaussian PDF, the logistic regression model as in Section 13.7.1 of Chapter 13 is adopted 
and similar approximations as before are employed. More on this more classical view on the Bayesian 
approach to neural networks can be obtained in [151,152]. 

Bayesian inference methods: More recently, an interest in the Bayesian view of deep networks 
has been revived in an effort to enforce regularization and pruning in a more efficient way. In this 
vein, ali parameters are treated as random variables that are described in terms of conditional and prior 
distributions. The latter act as regularizers. In the sequel, variational Bayesian techniques are mobilized 
to infer the posteriors. For example, in [142], hierarchical priors are introduced to prune nodes instead 
of individual weights. Also, the posterior uncertainties are exploited to determine optimal fixed point 
precision to encode the weights. 

In [174], nonparametric Bayesian arguments via the Indian buffer process (IBP) (Chapter 13) are 
mobilized to prune weights, nodes, or whole kernels in the case of convolutional networks (see Sec¬ 
tion 18.12). Also, the nonlinearities are substituted by fully probabilistic units involving competing 
local winner-take-all (LTWA) arguments. It is reported that the approach leads to very efficient struc- 
tures in terms of number of units as well as in terms of bit precision requirements. 


18.11 SHALLOW VERSUS DEEP ARCHITECTURES 


In our tour so far in this chapter, we have discussed various aspects of learning feed-forward networks 
involving a number of layers of nodes. The backpropagation gradient descent scheme, in its various 
formulations, was introduced as a popular algorithmic framework for training multilayer architectures. 
We also established some very important properties of the multilayer neural networks that concern 
their universal approximation property and also their power to solve any classification task comprising 
classes formed by the union of polyhedra regions in the input space. Two or three layers were, theoret- 
ically, enough to perform such tasks. Thus, it seems that everything has been said. Unfortunately (or 
maybe fortunately) this is far from the truth. 

Following the mid-1980s, feed-forward neural networks, after almost one decade of intense re- 
search, lost their initial glory and were superseded, to a large extent, by other techniques, such as 
kernel-based schemes, boosting and boosted trees, and Bayesian approaches. A major reason for this 
loss of popularity was that their training can become difficult and often backpropagation-related al- 
gorithms exhibit a slow convergence speed, or they can converge to a “bad” local minimum, which 
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was the general belief at the time. Although various “tricks” and techniques were proposed in order 
to improve convergence or find a better minimum, after training multiple times via different random 
initializations, stili their generalization performance was not competitive compared to other methods. 
This drawback appeared to be more severe if more than two hidden layers were used. As a matter of 
fact, efforts to use more than two hidden layers were soon abandoned. 

In this section, we are going to discuss whether there is any need for deep networks that involve 
more than two hidden layers. Furthermore, we are going to offer arguments that challenge the view that 
the existence of poor local minima comprises a major obstacle while training deep neural networks. 

18.11.1 THE POWER OF DEEP ARCHITECTURES 

In Section 18.3, we discussed how each layer of a neural network provides a different representation of 
the input patterns. The input layer described each pattern as a point in the feature space. The first hidden 
layer of nodes formed a partition of the input space and placed the input point in one of the regions, 
using a coding scheme of zeros and ones (for the Heaviside activation) at the outputs of the respective 
neurons. This can be considered as a more abstract representation of the input patterns. The second 
hidden layer of nodes, based on the information provided by the previous layer, encoded information 
related to the classes; this is a further representation abstraction, which carries some type of “semantic 
meaning.” For example, it provides the information of whether a tumor is malignant or benign, in a 
related medical application. 

The previously reported hierarchical type of representation of the input patterns mimics the way 
that a mammaTs brain follows in order to “sense” and “perceive” the world around us. The brain of 
the mammals is organized in a number of layers of neurons, and each layer provides a different repre¬ 
sentation of the input percept. In this way, different levels of abstraction are formed, via a hierarchy of 
transformations. For example, in the primate visual system, this hierarchy involves detection of edges 
and primitive shapes, and, as we move to higher hierarchy levels, more complex visual shapes are 
formed, until finally a semantics concept is established; for example, a car moving in a video scene, a 
person sitting in an image. The cortex of our brain can be seen as a multilayer architecture with 5-10 
layers dedicated only to our visual system [212]. 

On the Representation Properties of Deep Networks 

Following the previous discussion, an issue that is now raised is whether one can obtain an input-output 
representation via a relatively simple functional formulation, such as the one implied by the support 
vector machines or via networks with less than three layers of neurons/processing elements, that is 
equivalent in performance, maybe at the expense of more elements per layer. 

The answer to the first point is yes, as long as the input-output dependence relation is simple 
enough. However, for more complex tasks, where more complex dependencies have to be learned, 
for example, recognition of a scene in a video recording or in language and speech recognition, the 
underlying functional dependence is of a very complex nature so that we are unable to express it 
analytically in a simple way. 

The answer to the second point, concerning shallow networks consisting of only a few layers, lies 
in what is known as compcictness of representation. We say that a network, realizing an input-output 
functional dependence, is compact if it consists of relatively few free parameters (few computational 
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elements) to be learned/tuned during the training phase. Thus, for a given number of training points, 
we expect compact representations to resuit in better generalization performance. 

It turns out that using networks with more layers, one can obtain more compact representations 
of the input-output relation. Although there are no theoretical findings for general learning tasks to 
prove such a claim, theoretical results from the theory of circuits of Boolean functions suggest that a 
function which can compactly be realized by, say, k layers of logic elements may need an exponentially 
large number of elements if it is realized via k — 1 layers. Some of these results have been generalized 
and are valid for learning algorithms in some special cases. For example, the parity function with / 
inputs requires 0(2') training samples and parameters to be represented by a Gaussian support vector 
machine, 0{l 2 ) parameters for a neural network with one hidden layer, and 0(1 ) parameters and nodes 
for a multilayer network with 0(log 2 /) layers (see, for example, [16,17,172]). 

In [160,183], it is shown that for a special class of deep networks and target outputs, one needs a 
substantially smaller number of nodes to achieve a predehned accuracy compared to a shallow network. 
In [163], for networks employing ReLU, it has been shown that the composition of layers identifies 
linear regions in the input space, whose number has an exponential dependence on the depth of the net¬ 
work. In [46], it is shown that there is a simple function in IB/, which is expressive by a small three-layer 
feed-forward neural network, while it cannot be adequately approximated by any two-layer network, 
unless the number of nodes is exponentially large with respect to the dimension. This resuit holds true 
for virtually all known activation functions, including the ReLU. Formally, these results demonstrate 
that depth—even if increased by one—can be exponentially more valuable than width (number of 
nodes per layer) for Standard feed-forward neural networks. Results similar in spirit have been derived 
in [230]. In [34], the focus is on convolutional networks; employing arguments from tensor algebra, 
it is shown that besides a negligible (zero measure) set, all functions that can be realized by a deep 
network of polynomial size require exponential size in order to be realized, or even approximated, by 
a shallow network. 

In [146], the interplay between width (number of nodes in a layer) and depth is considered. The 
question posed is whether there are wide networks that cannot be realized by narrow networks whose 
size is not substantially larger. This is on the antipodal of the so-called “existence” path, that is, to 
find functions that are efficiently realizable with a certain depth but cannot be efficiently realized with 
shallower depths. It is shown that there exists a family of ReLU networks that cannot be approxi¬ 
mated by narrower networks whose depth increase is no more than polynomial. The theoretical and 
the experimental evidence in the paper points out that depth may be more effective than width for the 
expressiveness of ReLU networks. Moreover, the paper raises a number of open problems. At the time 
the current edition is compiled, the topic constitutes an area of ongoing research. 

Such arguments as the one before may seem a bit confusing to a newcomer in the field, because 
we have already stated that networks with two layers of nodes are universal approximators for a cer¬ 
tain class of functions. However, this theorem does not say how one can achieve this in practice. For 
example, any continuous function can be approximated arbitrarily close by a sum of monomials. Nev- 
ertheless, a huge number of monomials may be required, which is not practically feasible. In any 
learning task, we have to be concerned with what is feasibly “learnable” in a given representation. The 
interested reader may refer to, for example, [237] for a discussion on the benefits one is expected to get 
when using many-layer architectures. 

Let us now elaborate a bit more on the aforementioned issues and also make bridges to some of the 
techniques that have been discussed in previous chapters. Recall from Chapter 11 that nonparametric 
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techniques, modeling the input-output relation in RKHSs, establish a functional dependence of the 
form 


N 



(18.62) 


This can be seen as a network with one hidden layer, whose processing nodes perform kernel compu- 
tations and the output node performs a linear combination. As already commented in Section 11.10.4, 
the kernel function k(x, x n ) can be thought of as a measure of similarity between x and the respective 
training sample, x n . For kernels such as the Gaussian one, the action of the kernel function is of a local 
nature, in the sense that the contribution of k(x , x n ) in the summation tends to zero as the distance 
of x from x n increases (the rate of decreasing influence depends on the variance a 2 of the Gaussian). 
Thus, if the true input-output functional dependence undergoes fast variations, then a large number of 
such local kernels will be needed to model sufficiently well the input-output relation. This is natural, 
as one attempts to approximate a fast-changing function in terms of smooth bases of a local extent. 
Similar arguments hold true for the Gaussian processes discussed in Chapter 13. Besides the kernel 
methods, other widely used learning schemes are also of a local nature, as is the case for the decision 
trees, discussed in Chapter 7. This is because the input space is partitioned into regions via rules that 
are of a local nature. 

In contrast, assuming that the above stated input-output dependence variations are not random in 
nature but that there exist underlying (unknown) regularities, resorting to models with a more compact 
representation, such as networks with many layers, one expects to learn the regularities and exploit 
them to improve the performance. As stated in [194], exploiting the regularities that are hidden in the 
training data is likely to aid in the design of an excellent predictor for future events. The interested 
reader may explore more on these issues from the insightful tutorial [18]. 

Distributed Representations 

A notable characteristic of multilayer neural networks is that they offer what is known in machine 
learning as distributed representation of the input patterns. Take, as an example, the simple case where 
the neuron outputs are either 1 or 0, as discussed in Section 18.3. Interpreting the output of each node 
as a feature, the vector comprising these feature values, in a layer, provides information with respect 
to the input patterns; this is a distributed representation that is spread among all the features in a layer. 
Moreover, these are not mutually exclusive. It turns out that such a distributed representation is sparse, 
because only a few of the neurons are active each time. This is in line with what we believe happens 
in the human brain, where at each time less than 5% of the neurons, in each layer, fire, and the rest 
remain inactive. In the antipodal of such distributive representations would be to have a single neuron 
firing each time. 

In the case of neural networks with more general (compared to 0 and 1) activation functions, the 
features that are generated as outputs of the neurons in each one of the layers are shared by all patterns 
in all of the classes. For example, the same neurons are used and shared when learning to discriminate 
between, say, “airplanes” from “cars.” This makes a lot of sense, since many attributes are shared and 
are common to both classes. For example, the metal structure and the existence of wheels are common. 
In contrast, decision trees (Chapter 7) are not based on distributed representation. If a pattern from the 
“airplane” class is presented to the input, only the corresponding class leaf and the nodes on the path 
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from the root to this leaf are activated. What underlies distributed representations is that they build a 
similarity space in which semantically close input patterns remain close in some “distance” sense. 

At the other extreme of representation is the one offered by local methods, where a different model 
is attached to each region in space and parameters are optimized locally. However, it turns out that 
distributed representations can be exponentially more compact, compared to local representations. Take 
as an example the representation of integers in the interval [1,2,..., N]. One way is to use a vector of 
length N and to set for each integer the respective position equal to 1. However, a more efficient way 
in terms of the number of bits would be to employ a distributed representation, that is, use a vector of 
size log 2 N and encode each integer via ones and zeros positioned to express the number as a sum of 
powers of two. An early discussion on the benefits of distributed representation in learning tasks can 
be found in [82]. A more detailed treatment of these issues is provided in [18]. 

On the Optimization of Deep Networks: Some Theoretical Highlights 

In the beginning of this section, it was stated that dealing with deep networks in the 1980s and 1990s 
was abandoned due to the difficulty in their training. The general belief at the time was that because 
the number of the parameters becomes very large, the cost function in the parameter space becomes 
complicated and the probability of getting stuck in a local minimum significantly increases. 

The above belief has been seriously challenged after 2010. At that time, it was discovered that one 
can train large networks, provided that enough training data were used. It was around this time that large 
data sets were built and could be used for training in parallel with the advances in computer technology 
that offered the necessary computational power. As a matter of fact, this is the crucial factor for the 
comeback of the multilayer feed-forward neural networks, that is, computer technology together with 
the availability of large data sets. Of course, a number of techniques that in the mean time have been 
developed, such as the use of ReLU nonlinearity and the dropout method for regularization, combined 
with experience gained for the associated training, also have their share in the popularity and success 
of neural networks; however, these advances had a rather secondary contribution. 

As a consequence, the success of training big networks raised questions on issues related to local 
minima. Even if one has huge amounts of data, if the terrain of the cost function in the parameter space 
is “full” of local minima, then once the algorithm has been trapped in one of those, keeping training 
with more data would not make much sense. 

The previous setting ignited a related research happening and many interesting findings have chal¬ 
lenged the previous belief. At the time this edition is compiled, this is stili a hot topic of research. 
Our goal here is not to provide an extensive related discussion, but to give some basic directions and 
insights and make the reader alert of the issue. For example, in [38], it is argued that a more profound 
difficulty, especially in high-dimensional problems, originates from the proliferation of saddle points. 
The existence of such points can slow down the convergence of the training algorithm dramatically. In 
[62], it is argued that the effect of the large number of saddle points on the gradient descent is unclear, 
since the value of the gradient can be very small and slow down the convergence rate, yet it seems that 
the algorithm can escape such critical points. In [35], it is claimed that in large-size networks, most of 
the local minima yield low cost values and resuit in similar performance on a test set. Moreover, the 
probability of finding a poor (high cost value) local minimum decreases fast as the network size in¬ 
creases. Both papers borrow results from statistical physics on Gaussian random fields. A resuit similar 
in essence, yet via a different mathematical path, is derived in [116]. It is shown that for the square loss 
function and deep linear neural networks, every local minimum is also a global one. Also, every critical 
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point that is not a global minimum is necessarily a saddle point. The results are extended to nonlinear 
networks under certain independence assumptions. In a more recent paper [45], the issue is answered 
via the convergence of the gradient descent algorithm. For the case of the squared error loss function, 
it is shown that gradient descent finds a global minimum in training deep neural networks; this is in 
spite of the fact that the cost function is a nonconvex one with respect to the involved parameters. 
It is shown that the gradient descent achieves zero training loss in polynomial time. The paper deals 
with three different types of overparameterized neural networks. A fully connected, a convolutional, 
and a residual deep network, which will be discussed in Section 18.12. In [122], the convergence of 
the stochastic gradient descent is studied and it is argued that the algorithm will not get stuck at local 
minima with small diameters, as long as the neighborhoods of these regions contain enough gradient 
information. The neighborhood size is controlled by the step size and the gradient noise. The case of 
saddle points and how one can escape from them is treated in, e.g., [111,171]. 

On the Generalization Power of Deep Networks 

In the previous paragraph, some issues related to the convergence of the training algorithm and the 
nonconvexity of the optimizing task were briefly presented. The focus of the discussion now turns to 
issues related to the generalization performance of deep networks. This is stili an open problem, and at 
the time this edition is compiled the topic is a very active area of research. 

The generalization performance of any learner, including of course deep neural networks, is quan- 
tified by the difference between the training error and the test error. Good learners are those where 
the test error and the training error have close values. Also, as we have discussed in Chapter 3, if the 
number of parameters is large enough with respect to the size of the training data, overfitting occurs 
and we expect the test error to deviate from the training one. However, in the case of deep networks we 
are faced with a “paradox.” 

Training very large overparameterized networks, where the number of parameters is larger than the 
size of the training set, even without regularization, often (not always) the resulting network exhibits 
good generalization performance! In [256], it is shown that deep neural networks easily fit random 
labeis. In other words, the effective capacity of a neural network is large enough for a brute-force 
memorization of the entire data set. Furthermore, it is pointed out that in contrast with classical convex 
empirical risk minimization, where explicit regularization is necessary to rule out trivial Solutions, in 
the world of deep neural networks regularization seems to help improve the final test error of a model, 
yet its absence does not necessarily lead to poor generalization performance. 

An explanation for this phenomenon has been attempted via the implicit regularization imposed 
by the gradient descent algorithm. Take for example the least-squares and a linear model. We know 
(e.g., Sections 6.4 and 9.5) that when a system is underdetermined the least-squares solution is the 
minimum norm one. Similar arguments hold for other cost functions and models, when the gradient 
descent algorithm is used (e.g., [73]) or if its stochastic gradient version is employed (see, e.g., [184]). 
As it is stated in [219], there are many global minima of the training objective, most of which will 
not generalize well, but the optimization algorithm (e.g., gradient descent) biases the solution toward a 
particular minimum that does generalize well. The effect of the batch size on the generalization ability 
is studied experimentally in [1 19], where it is pointed out that small batch sizes have a beneficial effect. 
Some of the arguments posed in this paper are challenged in [44]. 

In [14], the generalization power of over-parameterized networks, where the number of the un- 
known parameters is larger than that of the training data samples, is discussed in the context of 
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function smoothness and small norm Solutions. The classical U-shaped graph of the test error versus 
the modeFs complexity, as discussed in Section 3.13, is reconsidered and it is pointed out that for over- 
parameterized networks the test error can be made to decrease even if the training error tends to zero. 

As said in the beginning, this is a new held of research, and a number of questions and issues are 
stili open and different points of view come to contribute to the discussion (see, e.g., [117]). 


18.12 CONVOLUTIONAL NEURAL NETWORKS 

In our discussion so far, we have assumed that neural networks are fed to their input layer with feature 
vectors. This is in line with any other classiher/predictor that has been discussed in previous chapters. 
Feature vectors are generated from the raw data in an effort to compact information that is relevant to 
the machine learning task at hand. To this end, for many years, a large pool of techniques has been 
developed to ht the nature of the data in different applications (see, e.g., [231]). 

An alternative breakthrough came in the late 1980s, when the feature generation phase was inte- 
grated as part of the training of a neural network. The idea was to leam the features from the data 
together with the parameters of the neural network and not independently. Such networks were called 
convolutional neural networks (CNNs) and their success was hrst demonstrated in the OCR task in 
recognizing digits of numbers [129]. The name “convolutional” comes from the fact that the hrst lay- 
ers in a neural network perform convolutions instead of inner products, which are the basic operations 
in the fully connected networks presented in Section 18.3.1 

18.12.1 THE NEED FOR CONVOLUTIONS 

Let us hrst make sure that we understand the reason of why we cannot, in practice, feed the input of 
a neural network directly with raw data, e.g., an image array or the samples of a digitized version of 
a speech segment, and why preprocessing is necessary in order to generate features. This is also true 
for any predictor/learner and not for neural networks only. We have commented on this issue in the 
introductory chapter, but there it was too early to grasp the true need. In many applications, working 
directly with raw data makes the task simply unmanageable. 

Let us take, as an example, the case of a 256 x 256 image array. Vectorizing it results in an input 
vector x e M ; , where / ~ 65000. Assume that the number of nodes of the hrst layer is k\ = 1000. Then 
the number of the involved parameters, Qjk, j = 1,2,, 65000, k= 1,2,... 1000, connecting ali the 
input nodes to all the nodes of the hrst layer in a fully connected network, would be of the order 
of 65 million. This number explodes further if the input image is a high-resolution one with pixels 
of the order of 1000 x 1000. Furthermore, this number increases if we deal with, for example, color 
images where the dimensionality of the input is multiplied by three for an RGB (red-green-blue) color 
representation scheme. Moreover, as one adds more hidden layers, the number of parameters keeps 
increasing. Besides the associated computational load issues, we know that training networks with a 
large number of parameters seriously challenges its generalization performance. Such networks would 
require a huge amount of training data in order to cope with the overhtting tendency. 

Besides the explosion in the number of parameters, vectorizing an image array leads to a loss of 
information; this is because we throw away important information with respect to how the pixels are 
interrelated within an area in an image. As a matter of fact, the goal of the various feature generation 
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techniques that have been developed over the years is exactly that. That is, to extract information that 
quantifies correlations or other statistical dependencies that relate pixel values within the image. In 
this way, one can efficiently “encode” learning-related information that resides in the raw data. 

By employing convolutions, one can simultaneously tackle both issues, i.e., that of the parameter 
explosion as well as the extraction of useful statistical information. The basic steps involved in any 
convolutional network are: 

• the convolution step, 

• the nonlinearity step, 

• the pooling step. 

The Convolution Step 

One way to reduce the number of parameters is via weight sharing, as has been briefly discussed at the 
end of Section 18.3.1. We will now borrow this idea of weight sharing and use it in a more sophisticated 
fashion. To this end, let us focus on the case where the input to the network comprises images. The 
input image array is denoted as I. For the case of Fig. 18.17A, thisisa3 x 3 array; note that the input is 
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FIGURE 18.17 

(A) A 3 x 3 input image array. (B) The “nodes” of the hidden layer can be thought of as elements of a two- 
dimensional array. Each node corresponds to a single parameter that is associated with the corresponding element 
of the array, H. In this case, the corresponding array is of size 2 x 2. To perform convolutions, one slides matrix H 
over matrix I. For the specific setup of the figure, four different positions are possible, indicated in (A) by different 
colors and/or types of lines. 

not vectorized. To stress out that we will depart from the rationale of the multiply-add (inner product) 
operations of the fully connected feed-forward networks, we will use a different Symbol, h, instead 
of 6, to denote the associated parameters. Let us now introduce the concept of weight sharing. Recall 
that in a fully connected network, each node is associated with a vector of parameters, Oj, for the /th 
node, whose dimensionality is equal to the number of nodes of the previous layer. In contrast, now, 
each node will be associated with a single parameter. To this end, arrange the nodes in the form of a 
two-dimensional array, as shown in Fig. 18. 17B. For the case of the figure, we have assumed four nodes 
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arranged in a 2 x 2 array, H. The first node is characterized by h{\, 1), the second by h(l, 2), and so 
on. In other words, any connection that ends at the first node will be multiplied by the same weight, 
h{\, 1), and a similar argument holds true for the rest of the nodes. This rationale reduces the number 
of parameters dramatically. However, in order for this to make sense, we have to move away from the 
inner product operations rationale of the fully connected networks. Indeed to understand why, let us 
assume that we use a single parameter per node in a fully connected network. Then the output of the 
linear combiner associated with the first node would be 0(1. 1) = /i(l, \)a, where a is the sum of ali 
the inputs to the node received from the previous layer. The respective output of the second node would 
beO(l,2) = /i(l, 2 )a, and so on. Hence, ali nodes would, basically, provide the same information with 
respect to the input values; the only difference would be the different weights acting upon the same 
input information from the previous layer. 

Let us now introduce a different concept, where we keep a single parameter per node, yet each one 
of the outputs of a hidden layer conveys different information with respect to the various inputs that 
are received from the previous layer. To this end, we will introduce convolutions. In this context, the 
nodes of the hidden layer are interpreted as elements of an array //, and we convolve H with the input 
array I. The first output value of the hidden layer will be 


0(1, 1) = /2(1, 1)7(1, 1) + /i(l, 2)7(1, 2) + /r(2, l)/(2, 1) + /i(2, 2)7(2, 2). 


The above resuit is obtained if we place the 2x2 matrix H on top of 7, starting at the left top corner (the 
full red square in Fig. 18.17A indicates the position of the H matrix). Then, we multiply corresponding 
elements in the overlapping parts of the two matrices and add them together. From a physical point of 
view, the resulting 0(1, 1) value is a weighted average over a local area within the I array. In the 
previous operation, the corresponding area of the image consists of the pixels within the 2 x 2 top left 
part of the array I. To obtain the second output value, 0(1, 2), we slide H one pixel to the right, as 
indicated by the dotted red square box, and repeat the operations, i.e., 


0(1, 2) = /x(l, 1)7(1, 2) + /2(1,2)7(1, 3) + /i(2, 1)7(2, 2) + h(2, 2)7(2, 3). 


Following the same rationale, we slide H so as to “scan” the whole image array; thus, two more output 
values are obtained, i.e., 0(2, 1) and 0(2, 2). The four possible positions of H on top of I are indicated 
in Fig. 18.17A by the full-red, dotted red, dark gray, and dotted gray square boxes. For each position, 
one output value is obtained. Hence, under the previous described scenario, the outputs of the first 
hidden layer form a 2 x 2 array, O. Each one of the elements of the output array encodes information 
from a different area of the input image. 

In the more general setting, the convolution operation between two matrices, H e R'" xm and I e 
R lxI , is another matrix, defined as 


m m 



(18.63) 


where for our case, m < I. In words, 0{i , j) contains information in a window area of the input array. 
According to the definition in Eq. (18.63), the element /(/, j) is the top left element in this window 
area. The size of the window depends on the value of m. The size of the output matrix depends on 
the assumptions that one adopts on how to deal with the elements/pixels at the borders of I. We will 
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come back to that soon. Strictly speaking, in the signal processing jargon, Eq. (18.63) is known as the 
cross-correlation operation. For the convolution operation, as has already been defined in Eq. (4.49), 
a fiipping of the indices has first to take place. 1 However, this is the name that has “survived” in the 
machine learning community and we well adhere to it. After ali, both operations perform a weighted 
averaging over the pixels within a window area of an image. 

The previous discussion “forces” us to seize thinking of a hidden layer as a collection of nodes one 
next to the other. Instead, in a CNN, each hidden layer corresponds to a (or more than one, as we will 
soon see) matrix H. Furthermore, H is used to perform convolutions. From a signal processing point 
of view, this matrix is a filter, which acts upon the input to provide the output. In the machine learning 
jargon, it is also called the kernel matrix instead of a filter. The output matrix is usually referred to as 
the feature map array. 

In summary, by performing convolutions, instead of inner product operations, we have achieved 
what was our original goal: (a) the parameters comprising the hidden layer are shared by ali the input 
pixels and we do not have a dedicated set of parameters per input element (pixel), and (b) the outputs of 
the hidden layer encode local neighborhood correlation information from the various areas within the 
input image. Also, since the output of the hidden layer is also an image array, one can consider it as the 
input to a second hidden layer and in this way build a network with many layers, each one performing 
convolutions. 



(A) The original image and (B) the image edges extracted after filtering the original image array with the filter 
matrix, H , in Eq. (18.64). 


As a matter of fact, such filtering operations have traditionally been used in order to generate fea- 
tures from images. The difference was that the elements of the filter matrix were preselected. Take, as 
an example, the following matrix: 


4 


Also, here, we subtract 1, since we start counting the indices from 1 and not from 0. 
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H = 


-1 -1 -1 

-1 8 -1 

-1 -1 -1 


( 18 . 64 ) 


The above filter is known as edge detector. Convolving an image array, I, with the previous matrix, H, 
detects the edges in an image. Fig. 18.1 8A shows the boat image and Fig. 18.18B shows the output after 
filtering the image on the left with the filter matrix H above. Detecting edges is of major importance 
in image understanding. Moreover, by changing appropriately the values in H, one can detect edges in 
different orientations, e.g., diagonal, vertical, horizontal; in words, changing the values of H one can 
generate different types of features. We have come closer to the idea behind CNNs. 

• Instead of using a fixed filter/kernel matrix, as in the edge detector example, leave the computation 
of the values of the filter matrix, H, to the training phase. In other words, we make H data-adaptive 
and not preselected. 

• Instead of using a single filter matrix, employ more than one. Each of them will generate a differ¬ 
ent type of features. For example, one may generate diagonal edges, the other one horizontal, etc. 
Hence, each hidden layer will comprise more than one filter matrix. The values of the elements 
of each one of the filter matrices will be computed during the training phase, by optimizing some 
criterion. In other words, each hidden layer of a CNN generates a set of features optimally. 


Feature map 3 



FIGURE 18.19 

Each pixel in a feature map corresponds to a specific area in the input image, which is known as the receptive field 
of the corresponding pixel. In this figure, three filters/kernels are used. The number of filters is known as depth or 
sometimes we refer to it as the number of channels. Hence, in this case, the depth of the hidden layer is three. 

Fig. 18.19 illustrates the input and the first hidden layer of a CNN. The input comprises an image 
array. The hidden layer consists of three filter matrices, namely, H\ , Ht, //3. Observe that each feature 
map array is the resuit of sliding (convolving) a different filter matrix over the input image. The more 
filters are employed, the more feature maps are extracted and, in principle, the better the network 
should perform. However, the more filters we use, the more parameters have to be learned, raising 
computational as well as overfitting issues. Note that each pixel in an output feature map array encodes 
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information within the window area that is defined by the corresponding position of the respective filter 
matrix. 

An important characteristic of a CNN is that translation invariance is naturally built into the net- 
work and it is a by-product of the involved convolutions. Indeed, the latter are performed by sliding 
the same filter matrix over the entire image. Thus, if an object, which is present in an image, is placed 
in another position, the only difference would be that the contribution of this object to the output will 
also move the same amount in the number of pixels. 

It is interesting to note that there is strong evidence from the visual neuroscience field that similar 
computations are performed in the human brain (e.g., [102,212]). The notion of convolutions was first 
used in [53] in the context of unsupervised learning. 

Below, we provide some jargon terms used in conjunction with CNNs. 


• Depth: The depth of a layer is the number of filter matrices that are employed in this layer. This is 
not to be confused with the depth of the network, which corresponds to the total number of hidden 
layers used. Sometimes, we refer to the number of filters as the number of channels. 

• Receptive field'. Each pixel in an output feature map array results as a weighted average of the pixels 
within a specific area of the input (or of the output of the previous layer) image array. The specific 
area that corresponds to a pixel is known as its receptive field (see Fig. 18.19). 
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(B) 


FIGURE 18.20 

The figure presents the case of an input matrix of size 5x5 and of a filter matrix 3 x 3. In (A), the stride is equal to 

j = 1 and in (B) it is equal to ,s = 2. In (A) the output is a matrix of size 3x3 and in (B) of size 2x2. 

• Stride: In practice, instead of sliding the filter matrix one pixel at a time, one can slide it by, say, s 
pixels. This value is known as the stride. For values of s > 1, feature map arrays that are smaller in 
size resuit. This is illustrated in Figs. 18.20A and B. 

• Zero padding: Sometimes, zeros are used to pad the input matrix around the border pixels. In this 
way, the dimension of the matrix increases. If the original matrix has dimensions / x /, after expand- 
ing it with p columns and rows, the new dimensions become (/ + 2 p) x (Z + 2 p). This is shown in 
Fig. 18.21. 
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FIGURE 18.21 

An example where the original matrix is 5 x 5 and after padding with p = 2 rows and columns, its size becomes 
9x9. 


• Bias term : After each convolution operation that generates a feature map pixel, a bias term, b , is 
added. The value of this term is also computed during training. Note that a common bias term is 
used for ali the pixels in the same feature map. This is in line with the weight sharing rationale; in 
the same way that all parameters of a filter matrix are shared by all the input image array pixels, the 
same bias term is used for all pixel locations. 

One can adjust the size of an output feature map array by adjusting the value of the stride, s, and the 
number of extra zero columns and rows in the padding. In general, it can easily be checked that if 
I e R /x/ , 77 e R mxm , s is the stride, and p is the number of extra rows and columns for padding, then 
the feature map has dimensions k x k, where 


k = 


l + 2/7 — m 
s 



(18.65) 


and L-J denotes the floor operation, i.e., |_3.7J = 3. For example, if / = 5, m = 3, p = 0, and s = 1, then 
k = 3. On the other hand, if l = 5 , m = 3, p = 0, and s = 2, then k = 2 (see Figs. 18.20A and B). 

Note that if the values of /, m, p , and s are such that the filter matrix, as it slides over 7, falis 
outside 7, such operations are not performed. We only perform operations as long as the filter matrix 
is contained within 7. 

One may wonder why one has to pad with zeros. Note that by the way convolutions are performed, 
the size of the feature map is smaller than that of the input array. As we will soon see, in a deep 
network, the output feature map is used as input to the next layer. Thus, the size of the arrays would be 
decreasing as we move on deeper into the network. In contrast, after padding with zeros, as Eq. (18.65) 
suggests, we can control the size of the involved arrays. For example, if / = 5 , m = 3, s = 1, and p = 1, 
then the output will have the same size k = 5 as the input array. As a matter of fact, if p = (m — 1)/2, 
for an odd value of m, k = L In such cases, we call the operation same convolution. The other reason 
that padding may be used is that the border pixels of the input image contribute less to the output, 
compared to the pixels that are located in the interior of the image. Take as an example Fig. 18.17. 
Pixel 7(1,1) contributes only to 0(1, 1). In contrast, pixel 7(2, 2) contributes to all output elements, 
since it is contained within the window area, in all four positions. Using padding with zeros, we give 
the chance to the border elements to have a more equal “say” to the output values, when compared with 
the interior pixels. 
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FIGURE 18.22 

(A) The image where the edges have been extracted and (B) the resulting image after applying the ReLU on each 
one of the pixels. Note that after filtering, the image array in (A) may involve negative values, which are set to zero 
after the ReLU activation. 


The Nonlinearity Step 

Once convolutions have been performed and the bias term has been added to ali feature map values, the 
next step is to apply a nonlinearity (activation function) to each one of the pixels of every feature map 
array. Any one of the nonlinearities that have been previously discussed can be employed. Currently, 
the rectified linear activation function, ReLU, seems to be the most popular one. 

Fig. 18.22A shows the image obtained after filtering the original boat image with the edge detector 
filter in Eq. (18.64) and Fig. 18.22B shows the resuit that is obtained after the application of the ReLU 
nonlinearity on each individual pixel. 

The Pooling Step 

The purpose of this step is to reduce the dimensionality of each feature map array. Sometimes, the 
step is also referred to as spatial pooling. To this end, one defines a window and slides it over the 
corresponding matrix. The sliding can be done by adopting a value for the respective stride parame- 
ter, s. 

The pooling operation consists of choosing a single value to represent all the pixels that lie within 
the window. The most commonly used operation is the miix pooling ; that is, among all the pixels that 
lie within the window, the one with the maximum value is selected. Another possibility is the avercige 
pooling , where the average value of all the pixels is selected; sometimes, this is known as sum pooling. 
The pooling operation is illustrated in Fig. 18.23. The original image array is 6 x 6 and the window is 
of size 2x2. We have chosen the stride to be equal to x = 2. That is, every time the window slides two 
pixels to the right or to the bottom. Different colors have been used to indicate the various positions 
of the window. Within each window, the maximum value is selected. The resulting matrix is of size 
3x3. The same formula as in Eq. (18.65) can be used to compute the size of the resulting matrix in 
the general case. Thus, the effect of pooling is to reduce (via downsampling) the dimensionality and 
make the size of the arrays smaller. This is important because the output of each layer is presented as 
the input to the next one. Hence, controlling the size of the arrays is of paramount importance, in order 
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FIGURE 18.23 

(A) The original matrix is of size 6x6. For pooling by a 2 x 2 window and stride s = 2, there are nine possible 
locations of the window. These locations are indicated by using different colors to show the elements that are 
grouped together in each one of the window locations. The maximum value per window location is indicated in 
bold. (B) The resulting 3x3 matrix affer max-pooling. 


to control the number of the involved parameters. Of course, the reduction in size should be done in 
such a way so that the loss of information is as small as possible. 



FIGURE 18.24 

(A) The edges of the boat image affer the application of the ReLU and (B) the resulting image affer applying max- 
pooling using an 8 x 8 window. Although the resolution is lower, the basic edge information has not been lost. 

Fig. 18.24 shows the effect of applying pooling to the image shown on the left. No doubt, the edges 
become coarser, yet the information related to edges can stili be extracted. Note that after pooling, the 
size of the image array is reduced. 

Looking at pooling from a different view, it can be said that it summarizes the statistics within the 
pooling area. Pooling can be considered as a special type of filtering, where instead of convolution, the 
maximum (or average) value is selected. It turns out that pooling helps the representation to become 
approximately invariant to small translations of the input. This can be understood by the following 
simple argument. If a small translation does not bring in the window a new largest element and also 
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does not remove the largest element by taking it outside the pooling window, then the maximum does 
not change. 

18.12.2 CONVOLUTION OVER VOLUMES 

In Fig. 18.19, the output of the first hidden layer comprises three image arrays. These will constitute 
the input to the next layer. Such an input setting that consists of multiple images is also the case when 
the input image is in color and its representation is given in terms of an RGB representation; that is, the 
input consists of three arrays, one per color. Another example is that of hyperspectral imaging, where 
the number of images is equal to that of the spectral bands (Section 13.13). Thus, in general, the inputs 
to the various layers are not two-dimensional arrays but sets of two-dimensional arrays. In mathemat- 
ics, such entities are known as multilinear arrays , or three-dimensional arrays, or three-dimensional 
tensors, or volumes. We will adhere to the latter term, because it is reminiscent of the associated ge- 
ometry, in the same way that we think of an image as a two-dimensional square. The question now is 
to see how one can perform convolutions when volumes are involved. 


h 




w 


w 


FIGURE 18.25 

A number of d matrices each of size h x w are stacked together to form a volume of size h x w x d. In this case, 
h = w = 5 and d = 3. 


By convention, the three dimensions of a volume will be represented as h for the height, w for the 
width, and d for the depth. Note that the depth d corresponds to the number of images involved. So, 
if we have three 256 x 256 images, then h = w — 256 and d — 3 and we will say that the volume is 
of size (dimension) 256 x 256 x 3. Fig. 18.25 illustrates the geometry associated with the respective 
definitions. 

Let the input to a layer be an h x w x d volume. When volumes are involved, hidden layers con- 
sist of filter/kernel volumes, too. However, there is a crucial point here. The filter volume associated 
with the hidden layer must be of the same depth as the input volume. The height and width dimen¬ 
sions can be (and in practice usually are) different. We are going to use bold capital letters to denote 
volumes. Assume that the input is the 1 x 1 x d volume I. Obviously, this comprises d images, say, 
r — 1, 2 , ..., d, each one of dimensions I x I. Let the filter be the m x m x d volume H. The latter 
comprises the set of d images, H r , r = 1,2,..., d, each one of dimensions m x m. Then the operation 
of convolution is defined via the following steps: 


1) Convolve corresponding two-dimensional image arrays to generate d two-dimensional output ar¬ 
rays, i.e., O, - = /,. * H r , r = 1, 2,..., d. 
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FIGURE 18.26 

The figure illustrates the case of depth d = 3. Convolving the / x / x d input volume (/) with a filter volume (H) 
with dimensions m x m x d is equivalent to convolving the d (l x /) matrices which comprise the input volume 
with an equal number of d (m x m) filters. This operation results in d output (feature map) matrices, which are sub- 
sequently added together. The value of k is determined by the specific values of /, m, and the stride .s. No padding is 
involved in the case of this figure. 


2) The convolution of the two volumes, 1 and H, is defined as 

d 

o=y.°>- 

r= 1 


In words, the convolution (denoted by *) of two volumes is a two-dimensional array, i.e., 


3D volume * 3D volume = 2D array. 


The operation is illustrated in Fig. 18.26. Corresponding arrays (shown by different colors and types 
of lines) are convolved. The three (d = 3) output arrays are subsequently added together to form the 
convolution of the two volumes. The dimension k of the output depends on the values of / and m, the 
stride s, and the padding p, if used, according to Eq. (18.65). 

In practice, each layer of a convolutional network comprises a number of such filter volumes. For 
example, if the input to a layer is an lxl xd volume, and there are, say, c kernel volumes, each one 
of dimensions m x m x d, the output of the layer will be a k x k x c volume, where k is determined 
as explained before. 

Network in Network and lxl Convolution 

The lxl convolution [139] does not make sense when two-dimensional arrays are involved. Indeed, 
a 1 x 1 filter matrix is a scalar. Convolving an lxl matrix I with a scalar a is equivalent to sliding 
the scalar value over all the pixels and multiplying each one of them with a. The resuit is the trivial 
multiplication ai. However, when volumes are involved, the lxl convolution makes sense. In this 
case, the corresponding filter, 7/, is a volume of size 1 x 1 x d. Geometrically, this is a “tube,” with 
h = w — 1 and d elements in depth, h{\, 1, r), r = 1,2,..., d. Hence, the output of convolving an 
l x l x d volume I with a 1 x 1 x d volume H is the weighted average. 
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d 

0 = I *H = J2 h ( 1 ' 1 ,r)I r , 

r= 1 

where /,-, r = 1,2,..., d, are the d arrays, each of dimensions / x /, that comprise I. Now, one may 
wonder why we need such an operation in practice. The answer is related to the size of the involved 
volumes; via the use of 1 x 1 convolutions, one can control and change their sizes to fit the needs of 
the network. 



FIGURE 18.27 

Illustration of the 1 x 1 convolution. There are c 1 x 1 x d tubes, where d is the depth of the input volume. The 
convolution of each one of the tubes with the input volume results in a two-dimensional array of the same dimen¬ 
sions as the input ones (5x5). The elements of the output arrays are weighted averages of the corresponding 
elements of the input arrays. For example, the red elements, at position (3, 3) of the 5 x 5 output arrays, are the 
weighted averages of the red points at the (3, 3) position of the input arrays. The weights used in the weighted aver- 
age are the respective elements that define the corresponding tube. Since we have c tubes, the output volume will be 
of depth c. 

Let us assume that at one stage/layer of a deep network we have obtained a volume 1 of dimensions 
k x k x d. To change the depth from d to c, while retaining the same size k, for the height and the 
width, we employ c volumes, H t , t = 1,2,.... c, each of dimensions 1 x 1 x d. Performing the c 
convolutions, we obtain 


d 

O t = I*H t = ^h t {l,\,r)I r , f = 1,2,. .., c. (18.66) 

r= 1 

Stacking O t , t = 1,2..., c, together, we obtain the volume O of dimension k x k x c. The operation 
is illustrated in Fig. 18 . 27 . 

For example, we can employ lxl convolution to reduce the size, by selecting c < d. In this way, 
although we reduce the depth, the elements of the new volume are weighted averages of the original 
I. So, the original information is stili retained within the new volume, in an averaging fashion. Often, 
once the new volume O is obtained, its elements are “pushed” through a nonlinearity, e.g., ReLU. 
Sometimes, the lxl convolution followed by the nonlinearity is referred to as network in network 
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operation and its purpose is to add an extra nonlinearity stage in the flow of operations through the 
network. Thus, in this context, if c < d, the network in network operation can be thought of as a 
nonlinear dimensionality reduction technique. 

Looking at the lxl convolution from a different viewpoint, it is nothing but a layer of a fully 
connected neural network, with c nodes. The weights connecting the fth node with the d input values 
are h t (1, 1, r ), r = 1,2,..., d. This is the reason that we also call the operation network in network, in 
the sense that we embed a fully connected neural network between two successive convolution layers 
in the network. In this way, one can add an extra nonlinearity in the flow of operations through the 
network. In practice, the parameters of the respective //, tubes are computed during the training phase 
of the overall network. 

Example 18.4. Let us consider the input to a layer to be a 28 x 28 x 192 volume, 7. The goal is 
to produce at the output of the layer a volume, O , which is of dimension 28 x 28 x 32. To this end, 
employ 5x5 same convolutions. Compute the number of required operations. 

a) The direct way: Since same convolutions are required, we should first pad ali the arrays that are 

stacked in I with p zero columns and rows, where p — (5 — 1)/2 = 2 (as has been explained in the text 
before, when “same convolutions’’ were defined). After padding, the height and width of each image 
become h = w = 32. Using s = 1 for the stride, the total number of possible locations, as we slide the 
5x5 window over a 32 x 32 image array, is 28 2 = 784. For each location of the window, we need 
to perform 5 2 = 25 multiplications and additions (MADS), or 784 x 25 = 19600 MADS operations 
per image. Since the volume involves 192 images, the total number of MADS operations needed is 
19600 x 192 3.7 x 10 6 . These operations are needed for each channel. Since the depth of the output 

should be 32, we need 32 such filter volumes (channels) and the total number of required MADS 
operations will be 3.7 x 10 6 x 32 ^ 120 million. 

b) Via the use of lxl convolutions: We will now produce a 28 x 28 x 32 output volume using a 
substantially lower number of operations. Our path will involve two stages. We will first employ lxl 
convolutions to generate an intermediate volume, O', of dimensions 28 x 28 x 16. To this end, we use 
16 1 x 1 x 192 filter volumes and perform the respective convolutions, according to Eq. (18.66). The 
corresponding number of MADS is 28 2 x 192 x 16 ~ 2.4 million. In the sequel, we pad each one of the 
arrays contained in volume O' with p = 2 extra zero columns and rows (as before); then, we perform 
same convolution with 32 filter volumes H t , t =1,2,... 32, each one of dimensions 5 x 5 x 16, to 
obtain the output volume, O, of dimensions 28 x 28 x 32. The respective number of operations is 
28 2 x 25 x 16 x 32 & 10 million. Thus, now, the total number of operations, for both stages, amounts 
to approximately 12.4 million, which is substantially lower than the 120 million that were needed via 
the previous path (a). 

Often, the intermediate volume, O ', is known as the bottleneck layer; its role is to “shrink” the 
size of the input volume first, prior to obtaining the final output one. The overall layout is shown in 
Fig. 18.28. 

18.12.3 THE FULL CNN ARCHITECTURE 

The typical form of a full convolutional network consists of a sequence of convolutional layers, each 
comprising the three basic steps, namely, convolution, nonlinearity, and pooling, as described in the 
beginning of Section 18.12.1. Depending on the application, one can stack as many layers as required, 
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FIGURE 18.28 

The bottleneck layer. In the first step, c = 16 1 x 1 convolutions are performed and the output is a 28 x 28 x 16 
volume. Then, we apply a same 5 x 5 x 32 convolution to obtain the final 28 x 28 x 32 volume. 


where the output of one layer becomes the input to the next one. Inputs and outputs to each layer are 
volumes, as described before. The general architecture is illustrated in Fig. 18.29. In the first layer, 
a number of filter volumes (channels) is employed to perform convolutions followed by the ReLU 
(usually) nonlinear operation. Then, the pooling stage takes over to reduce the height and width of each 
output volume, which is then used as input to the second layer, and so on. Finally, the output volume 
of the last layer is vectorized. Sometimes, this is also referred to as flattening operation. In words, ali 
the elements of the output volume are stacked one under the other to form a vector. Vectorization can 
take place via various strategies. As a matter of fact, the obtained vector forms th efeature vector that 
has finally been generated via the various transformations that the convolutions implement layer after 
layer. This feature vector will then be used as input to a learner, for example, to a fully connected neural 
network (lower part of the figure) or to any other predictor, such as a kernel machine. 

The general strategy is to keep reducing the height and width while increasing the depth of the 
volumes. Larger depth corresponds to more filters per stage, which translates to more features. Both 
the number of convolutional layers and the number of layers of the fully connected network depend 
heavily on the application and, up to now, there is not a formal method to determine automatically 
the number of layers as well as the number of filters or nodes per layer. The choice is a matter of 
“engineering” and evaluation of different combinations to select the best. A good practice is to select 
an existing architecture that has been used before in the related application and start from there. A more 
recent research path towards developing more systematic ways of learning the number of nodes/filters 
per layer is via Bayesian learning arguments (see, e.g., [174]). 

Training of a convolutional network follows a similar rationale as that of backpropagation, which 
has been discussed in Section 18.4. However, certain modifications should be involved in order to take 
care of the constraints imposed by the weight sharing (see, e.g., [59] for details). 

As a final remark, recall that the crucial point for training deep networks is, besides the available 
computational power, the existence of large data sets (e.g., [200,262]), which have made the training 
of such big networks possible. 

What Deep Neural Networks Leam 

It is by now well established that convolutional networks do work well and at the time this edition 
is compiled, they constitute the state of the art in a large number of diverse applications. However, a 
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FIGURE 18.29 

The full CNN comprises a series of layers. In the figure, two such layers are shown. Each layer consists of the 
convolution step, followed by the application of the nonlinearity, and then the pooling step, which reduces the 
height and width of the images that comprise the respective volume. The depth remains the same. As we move to 
higher layers, the tendency is to reduce the height and width and increase the depth of the volumes. The final output 
volume is vectorized and presented as input to a learner, usually a fully connected network. 


critical question is: what type of features does a CNN leam? In other words, what is the information that 
is propagated from one layer to the other? This is crucial to the understanding of why they work so well. 
The answer to this question could facilitate their interpretability, which is of paramount importance in 
specific applications, such as in medical and financial fields. Also, such understanding could help to 
develop improved models. 

To this end, visualization techniques have been mobilized to reveal the specific input stimuli that 
excite individual feature maps at any layer (e.g., [261]). In this paper, an experimental study has been 
conducted in the context of computer vision. The findings reveal a hierarchical nature of the features 
produced, as one moves from the input to the final output. For example, layer #2 seems to respond to 
the corners and other edge/color conjunctions associated with the objects present in the input image. 
Layer #3 has more complex invariances, capturing similar textures. Higher layers reveal more class- 
specific information, e.g., dog faces, bird lengths, etc. Such findings are in line with the discussion in 
Section 18.11.1. 

Another interesting finding is that convergence in the lower layers, closer to the input, is rather 
fast. In contrast, the upper layers, closer to the output, develop after a considerable number of epochs. 
Furthermore, concerning feature invariance, with respect to translation and scaling, it seems that small 
transformations may have a dramatic effect in the first layers but a smaller impact on the top layers. 

With the growing success of deep neural networks, the need of being able to explain the predictions 
is of paramount importance in building up confidence for their deployment in real-world applications. 
To this end, there are stili a number of open questions and the topic is currently an ongoing research 
area. A more detailed coverage is beyond the scope of this chapter and the interested reader may obtain 
a good feeling of the current trends from, e.g., [27] and the references therein. 
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More recently (e.g., [56] and the references therein), the importance of texture in recognizing ob- 
jects in images has been outlined. It is postulated that local textures can provide sufficient information 
about object classes and object recognition could, in principle, be achieved through texture recognition 
alone. This seems to be in line with the results obtained in [22], where it is demonstrated that using 
small local patches, rather than integrating object parts for shape recognition, can achieve surprisingly 
high accuracies. Such findings may also facilitate the interpretability issue that has been previously 
discussed, since one can follow easier on how the evidence from smaller image patches is integrated to 
reach the final decision. 

In a different direction, research has been focused on revealing various aspects associated with the 
multilayer structure of such networks in an effort to shed light from different viewpoints, which can 
help in their understanding. For example, in [24,156] deep architectures are implemented via a Cascade 
of wavelet transform convolutions, combined with a nonlinear operation followed by an averaging 
operation, so as to build translation-invariant representations. Furthermore, such networks preserve 
high-frequency information related to classification. In [25], it is shown that the pooling step in deep 
CNNs results in shift invariance. In [57], it is shown that deep neural networks with random Gaus- 
sian weights perform a distance preserving embedding of the data. In this analysis, tools from the 
compressed sensing and dictionary learning tasks are employed, which establishes a bridge of deep 
learning with the topics treated in Chapters 9 and 19. A closely related dictionary learning approach is 
also followed in [185]. A multilayer convolutional sparse coding scheme is adopted and the similarity 
with deep convolutional networks is established. The ReLU nonlinearity is seen as a special type of a 
soft thresholding operation (Chapter 9, [48]). 

In [204,233], information theoretic arguments are mobilized and deep networks are viewed as a 
succession of intermediate representations in a Markov chain; this is closely related to the successive 
refinement of information in rate distortion theory. Each layer in the network can now be characterized 
by the amount of information it retains from the input variables, on the target output variables, and 
on the predicted outputs of the network. In [10], a bridge between deep networks and approximation 
theory via spline functions is established. It is shown that a large class of deep networks can be written 
as a composition of max-affine spline operators. 

Finally, another path is the one that establishes bridges between Gaussian processes and deep net¬ 
works. This held is not new, and its origins can be traced back to the early 1990s [168]. Since then, it is 
well known that a single-layer fully connected neural network with an i.i.d. prior over its parameters is 
equivalent to a Gaussian process, in the limit of infinite network width. In more recent years, general- 
izations to more layers have been established and the topic seems to regain popularity (see, e.g., [136], 
[7], [30] and the references therein). 

18.12.4 CNNS: THE EPILOGUE 

What we have described in the last subsection are the basic steps that are used to design a CNN. There 
are a number of variants around the architecture given in Fig. 18.29. Also, there are different tricks and 
algorithms that can be used to perform computations, e.g., for the efficient computation of the involved 
convolutions. Undoubtedly, there is a lot of “engineering” involved to make such big networks to learn 
the parameters and run efficiently in practical applications. Below we provide a brief description of 
some classical convolutional networks. The reader who wants to become familiar and get a deeper 
understanding of CNNs is advised to read the related papers. Although some of the implementation 
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tricks adopted there may not be in use today, stili these papers can help the reader to get further insight 
and understanding of CNNs. 

LeNet-5 : This is a typical example of the first generation of CNNs and it was built to recognize digits of 
numbers (see, e.g., [132]). For historical reasons, let us comment a bit on its architecture. The input of 
the network consists of grayscale images of size 32 x 32 x 1. The network employs two convolution 
layers. In the first layer, the output volume has size 28 x 28 x 6, which after pooling becomes 14 x 
14 x 6. The dimensions of the volume in the second layer were lOx lOx 16 and after pooling 5 x 5 x 
16. The nonlinearity used at the time was of the sigmoid type. Observe that the height and the width of 
the volumes decreases and the depth increases, as pointed out before. The number of elements of the 
last volume is equal to 400. These elements are stacked into a vector and feed the corresponding input 
nodes of a fully connected network. The latter consists of two hidden layers with 120 nodes in the first 
and 84 nodes in the second. There are 10 output nodes, one per digit, using a softmax nonlinearity. The 
total number of the involved parameters is of the order of 60 thousand. 

AlexNet : This network is also a historical one since it demonstrated that the crucial point for making 
big networks to work is the availability of large training sets [126]. The related paper is the one that 
really brought CNNs back into the scene and acted as a catalyst for their adoption much beyond the 
digit recognition task. The Alexnet is a development of LeNet-5, yet it is much bigger and involves 
approximately 60 million parameters. The inputs to the network are RGB images of size 227 x 227 x 
3. It comprises five hidden layers and the final volume consists of 9216 elements that feed a fully 
connected network with two hidden layers of 4096 units each. The output consists of 1000 softmax 
nodes (one per class) to recognize images from the ImageNet data set for object recognition [200]. The 
ReLU has been used as the nonlinearity in the hidden layers. 

VGG-16: This network [216] is much larger than AlexNet. It involves a total of approximately 140 
million parameters. The main characteristic of this network is its regularity. It involves 3x3 filters to 
perform same convolutions using padding and stride 5 = 1 and 2x2 Windows for maxpooling with 
stride s — 2. Every time pooling is involved, the height and the width of the volumes are halved and 
every time the depth is increased by two. Starting with 224 x 224 x 3 input images and after 13 layers, 
the final volume has size 7 x 7 x 512, a total of 7168 elements, which after its vectorization is fed to 
a fully connected network with 2 hidden layers, each comprising 4096 nodes. The 1000 output nodes 
are built around the softmax nonlinearity and ReLU has been used for the hidden units throughout the 
network. 

GoogleNet and the Inception network : The architecture used in this network deviates from the “archety¬ 
pa!” one given in Fig. 18.29. At the heart of this network lies the so-called inception module [226]. 
An inception module consists of filters of different sizes and depths as well as a different pooling path. 
A typical architecture of an inception model that provides its rationale is shown in Fig. 18.30. Note 
that the output volume of the previous layer becomes the input to different paths. One involves a 1 x 1 
convolution that acts on the depth of the input volume. Another path performs pooling and then feeds 
a 5 x 5 convolution. Two paths feed separately two different convolution stages, one based on a 3 x 3 
filter and the other on a 5 x 5. Prior to the convolutions, a bottleneck layer, via lxl convolution, is 
employed to reduce the respective computational load (Example 18.4). The output volumes of all these 
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FIGURE 18.30 

The inception module concept. Each layer comprises different paths of convolutions. The intermediate outputs from 
ali the paths are concatenated together, across the channel dimension, to build up the final output volume of the 
corresponding layer. 


paths are then concatenated to form the final output of this stage. The idea behind the inception module 
is to leave the network, during the training phase, to “decide” what operations fit best for the different 
layers and inputs. For example, as we have already commented, features of higher abstraction are cap- 
tured by layers closer to the output of the network. Hence, the spatial concentration of the respective 
information is expected to decrease. This suggests that the ratio of 3 x 3 and 5x5 convolutions should 
increase as we move to higher layers. The number of layers of the network was 22 and the total number 
of parameters reported in the paper was of the order of 6 million. 

Resiclual networks (ResNets): The benefits of designing deep networks have already been discussed. 
We also addressed ways on how to cope with the problem of vanishing/exploding gradients by a com- 
bination of methods and tricks that enable the backpropagation algorithm to converge sufficiently fast. 
However, once we start building very deep networks (of the order of tens or even hundreds of layers) 
we are confronted with the following “unorthodox” behavior. 

One would expect that, adding more and more layers, the training error improves or at least does not 
increase. However, what one observes in practice is that beyond a certain number of layers, the training 
error starts increasing. This is graphically illustrated in Fig. 18.31. This phenomenon has nothing to 
do with overfitting. After ali, we are talking about the training error and not the generalization one. It 
seems that this may be due to the optimization task that becomes harder and harder as more and more 
layers are added. Mathematically, any layer, say, the rth layer, can be seen as a mapping that maps the 
corresponding input, e.g., y r ~ 1 , to the output, e.g., y 1 . Let us denote the respective mapping as 


/ = H{f-\ 



























974 


CHAPTER 18 NEURAL NETWORKS AND DEEP LEARNING 



FIGURE 18.31 

When a network is very deep, it tums out that the training error starts increasing when the number of layers exceeds 
some number, instead of being decreased (red curve) as expected from theory. 


Taking this view, it is easy to see why, when adding more layers, the training error should not be 
increased. In the worst case, where all information has been extracted up to a layer r, we expect that 
adding an extra layer this should implement the identity mapping, i.e., y' — H(y r ~ l ) = y r . That is, 
the extra layer adds no information and simply copies the input to the output. However, it seems that 
once the network starts becoming very deep, accuracy gets “saturated” and the optimization tools have 
a difficulty to come up with an accurate enough solution to this identity mapping, at least within a 
feasible time. 

One way to bypass this difficulty is proposed in [76]. The idea is to fit an alternative equivalent 
mapping, i.e., 

Then the original mapping, H(y r ~ l ), becomes equal to F(y'~ l ) + y r ~ l . In practice, it tums out that 
optimizing with respect to the residual mapping, F, is easier than optimizing with respect to the original 
one, H. In the extreme case, when an identity mapping should be realized, it seems that it is easier to 
push the residual to zero than fitting the identity one. 

The use of residual representation is not new and has been used before in the context of vector 
quantization. The essence of the residual learning is to introduce the so-called residual building block, 
shown in Fig. 18.32. In this way, a number of layers, say, two, as in the case of the figure, are stacked 
together, and we explicitly let these layers fit the residual mapping, via the so-called shortcut or skip 
connections. Each weight layer performs a transformation over its input, e.g., convolutions. If y r and 
y r ~ l are of different dimension, then the identity mapping shortcut is modified to Wy r ~ l , where W is 
a matrix of appropriate dimensions. 

Fig. 18.33A shows a schematic example of a so-called plain network (no residuals involved) that 
comprises a sequence of convolutional layers. Its residual counterpart is shown in Fig. 18.33B. Note 
that if the identity shortcut is used, then no extra parameters are involved and training follows the 
Standard backpropagation rationale. A variant of the residual network, known as highway network, has 
been proposed (e.g., [69,21 1]), where extra data dependent gating functions that control the flow in the 
shortcuts have been introduced. 
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FIGURE 18.32 

Two successive layers in a network have been combined and their implied combined transformation is performed 
via the identity mapping, implemented by the red line shortcut/skip connection. 



(A) 



(B) 


FIGURE 18.33 

(A) The layout of a plain network and (B) the corresponding layout with the use of shortcuts. 


In [76], networks as deep as 50 to 152 layers have been constructed. In spite of such sizes, it is 
reported that the overall number of arithmetic operations remains substantially lower than that required 
by VGG-16, at an improved performance in error rates. In [229], the concept of residual networks is 
combined with that of the inception networks. It is reported that training with residual connections can 
accelerate the training of inception networks significantly. 

DenseNeV. As explained before, ResNets consist of a sequence of connected residual blocks. In this 
way, one layer accepts input not only from its direct predecessor layer, but also by a previous one, 
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via the shortcut (skip) connection and these two are added together. In contrast, in the DensNet [100], 
the building blocks are the so-called dense blocks, which combine a number of layers together, and 
every layer within the block receives inputs from all the previous ones within the block. Furthermore, 
these inputs are not added but are concatenated together. The reported results indicate that in this way 
one can reduce the number of operations and parameters involved without sacrificing performance. 
Numbers of layers as high as 250 were tried. 


18.13 RECURRENT NEURAL NETWORKS 

Recall from the previous section that at the heart of convolutional networks lies the concept of weight 
sharing. That is, the same filter matrix is sliding over an image array instead of dedicating a specific 
weight for each image pixel. In this way, a neural network can easily scale to images of different 
dimensions. 

Our interest in the current section turns on the case of sequential data. That is, the input vectors are 
not independent but occur in sequence. Moreover, the specific order in which they occur encapsulates 
important information. For example, such sequences occur in speech recognition and in language Pro¬ 
cessing, e.g., machine translation. Undoubtedly, the sequence in which words occur is of paramount 
importance. Dynamic graphical models, such as Kalman hltering and hidden Markov models (HMMs), 
which have been treated in Chapters 16 and 17, are models that deal with sequential data. 

Weight sharing via convolutions could also be and have been used for such cases (see, e.g., [128]). 
Such networks are known as time-delay neural networks. However, sliding a filter across time to per- 
form convolutions is an operation of a local nature. The output is a function of the input samples within 
a time window spanned by the length of the respective impulse response of the filter, which for practical 
reasons cannot be very long. 

To bypass the aforementioned drawback of limited memory, in the current section we focus on 
networks that build upon the concept of the state. As was the case with the HMMs and Kalman hltering, 
the state vector “encodes” the past history up to the current time n. The idea behind recurrent neural 
networks (RNNs) is to apply the same type of operations (weight sharing) at each time instant (which 
justihes the term recurrent) by involving the current state (previous history) as well as the value of the 
current input. In this way, a network can scale well to sequences of different lengths; this is because 
one does not assign specific weights at the different instants and the same weights are shared across 
the whole time axis. 

The variables that are involved in an RNN are: 

• the state vector at time n , denoted as h„. The Symbol reminds us that h is a vector of hidden variables 
(hidden layer in the neural network jargon); the state vector constitutes the memory of the system, 

• the input vector at time n , denoted as x„, 

• the output vector at time n, y n , and the target output vector, y n . 

The model is described via a set of unknown parameter matrices and vectors, namely, U, W. V, b, 
and c, which have to be learned during training, in analogy to the unknown parameters in an HMM, 
which are also learned during training. 
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The equations that describe an RNN model are 


h n = f(Ux n + Wh n —[ +b), 

y n = g(vh n + c ), 


(18.67) 

(18.68) 


where the nonlinear functions / and g act element-wise and are applied individually on each element 
of their vector arguments. In words, once a new input vector has been observed, the state vector is 
updated. Its new value depends on the most recent information, which is conveyed by the input x„ as 
well as the past history, as this has been accumulated in h n _\. The output depends on the updated state 
vector, h n . That is, it depends on the “history” up to the current time instant n, as this is expressed 
by h n . 

Note that Eqs. (18.67) and (18.68) are very similar to the extended Kalman filter defined in Chap- 
ter 4. Note, however, that in the present case the involved matrices and vectors, [/, W, V, b, and c 
are unknown and have to be learned. Such a view of the RNNs, in the context of extended Kalman 
filtering, has been adopted in [32]. Typical choices for / are the hyperbolic tangent, tanh, or the ReLU 
nonlinearities. The initial value ho is typically set equal to the zero vector. The output nonlinearity, g, 
is often chosen to be the softmax function, introduced in Eq. (18.45). 



(A) 


(B) 


FIGURE 18.34 

(A) The input, x, “feeds” the (hidden) state, h. which is updated using also its previous value (self-loop in the 
graph). In turn, it generates the output, y. (B) The operations involved in an RNN as time evolves are shown, start- 
ing from an initial value ho of h. As input vectors are sequentially observed, the corresponding output vectors are 
produced and the updated state vectors are passed to the next stage (time instant). The process goes on, until the 
final output vector is computed, for an input sequence of length N. 


From the above equations, it is ciear that the parameter matrices and vectors are shared across 
ali the time instants. During training, they are initialized via random numbers. The graphical model 
associated with the pair of Eqs. (18.67) and (18.68) is given in Fig. 18.34A. In Fig. 18.34B, the graph 
is unfolded over the various time instants for which observations are available. For example, if the 
sequence of interest is a sentence of 10 words, then N is set equal to 10, while x„ is the vector that 
codes the respective input words. 
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18.13.1 BACKPROPAGATION THROUGH TIME 

Training RNNs follows a similar rationale as that of the backpropagation algorithm for training feed- 
forward neural networks, as has already been discussed in Section 18.4.1. After ali, an RNN can be 
seen as a feed-forward network with N layers. The top layer is that at time instant N and the first layer 
corresponds to time n = 1. A difference lies in that the hidden layers in RNN also produce outputs, i.e., 
y n , and are fed directly with inputs. However, as far as the training is concerned, these differences do 
not affect the main rationale. 

Learning the unknown parameter matrices and vectors is achieved via a gradient descent scheme, 
in line with Eqs. (18.10) and (18.11). It turns out that the required gradients of the cost function, with 
respect to the unknown parameters, take place recursively, by starting at the latest time instant, N , and 
going backwards in time, /2 = /V — I, /V — 2,.... 1. This is the reason that the algorithm is known as 
backpropagation through time (BPTT). 

The cost function is the sum over time, n, of the corresponding loss function contributions, which 
depend on the respective values of h n , x n , i.e., 


N 

J(U, W, V, b,c) = J2 J "( U ’ W ’ V ’ b ' c) - 

n =1 

For example, for the cross-entropy loss function case, 

J n (U, W, V, b , c) :=-£ y„k In y„k, 

k 

where the summation is over the dimensionality of y, and 

y„ = g(h n , V, c) and h n = f(x„,h n - 1 , U, W, b). 

It turns out that at the heart of the computation of the gradients of the cost function with respect to 
the various parameter matrices and vectors lies the computation of the gradients of J with respect to 
the state vectors, h„. Once the latter have been computed, the rest of the gradients, with respect to 
the unknown parameter matrices and vectors, is straightforward. To this end, note that each /i„, n — 
1, 2,..., N — 1, affects J in two ways: 

• directly, through J n , 

• indirectly, via the chain that is imposed by the RNN structure, i.e., 

h n —>■ /*n+l hiv ■ 

That is, h„, besides also affects all the subsequent cost values, J n +i, ■ ■ ■, Jn- 

Employing the chain rule for derivatives, the above dependencies lead to the following recursive 
computation: 


dJ 

/ dh n+ i ' 

^ dJ ,, 

f dy n N 

v r 3 J 

3 h n 

V 3 h„ , 

> dh„+i 1 

[dh n/ 

' 3 y„ 


' ' v - 


indirect recursive part direct part 


(18.69) 
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where, by definitiori, the derivative of a vector, say, y, with respect to another vector, say, x, is defined 
as the matrix |^- Note that the gradient of the cost function, with respect to the hidden 

parameters (state vector) at layer is given as a function of the respective gradient in the layer 
above, i.e., with respect to the state vector at time n + 1. The full proof of the backpropagation in time 
is given in Problem 18.12. 

The two passes required by the backpropagation through time are summarized below. 

• Forward pass: 

- Starting at n = 1 and using the current estimates of the involved parameter matrices and vectors, 
compute in sequence, 


{h\,y i) (h 2 y 2 ) ■ 


(hN,y N )- 


• Backwcird pass: 

- Starting at n = N, compute in sequence, 

dj dj dj 

dhs/ dh^-i dh[ 

Note that the computation of the gradient is straightforward and it only involves the direct part in 
Eq. (18.69). 

For the implementation of the BPTT, one proceeds by (a) randomly initializing the involved un- 
known matrices and vectors, (b) computing ali the required gradients, following the previously stated 
two passes, and (c) performing the updates according to the gradient descent scheme. Steps (b) and (c) 
are performed in an iterative manner until a convergence criterion is met, in analogy to the Standard 
backpropagation Algorithm 18.2. 

1 Vanishing and Exploding Gradients 

The task of vanishing and exploding gradients has been introduced and discussed in Section 18.6, in 
the context of the backpropagation algorithm. The same problems are present in the BPTT algorithm. 
After ali, the latter is a specific form of the backpropagation concept, and, as said, an RNN can be seen 
as a multilayer network, where each time instant corresponds to a different layer. As a matter of fact, 
in RNNs, the vanishing/exploding gradient phenomenon appears in a rather “aggressive” way, taking 
into account that N can get large values. 

The multiplicative nature of the propagation of gradients can be readily spotted in Eq. (18.69). To 
help the reader grasp the main concept, let us simplify the setting and assume that only one state vari- 
able is involved. Then the state vectors become scalars, li n , and the matrix W a scalar w. Furthermore, 
assume the outputs to be scalars, too. Then the recursion in Eq. (18.69) is simplified as 

9/ _ dh n+ 1 dJ dy„ dJ 
dh n 9 h n 9 h n+ \ 9 hii 9 y n 


(18.70) 
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Assuming in Eq. (18.67) / to be the Standard tanh function and using its respective derivative from 
known tables, 5 it is readily seen that 


dhn+i 
3 h„ 


= w(l — h 2 +l ), 


where, by the definition of the tanh function, the magnitude of h n +1 is smaller than 1. Writing the 
recursion for two successive steps, we get 

9 J i 'i 9 dJ 

— = w 2 ( 1 - h~ +1 )( 1 - h 2 n+2 )— -+ other terms. 

dh n dh n+ 2 

It is not difficult to see that the multiplication of the terms smaller than one can lead to vanishing values, 
especially if we take into account that, in practice, sequences can be quite large, e.g., N — 100. Hence, 
for time instants close to n — 1, the contribution to the gradient of the first term on the right-hand side 
in Eq. (18.70) will involve a large number of products of numbers less than one in magnitude. On the 
other hand, the value of w will be contributing in w n power. So, if its value is larger than one, it can 
lead to exploding values of the respective gradients (see, e.g., [179]). 

In a number of cases, one can truncate the backpropagation algorithm to a few time steps. Another 
way is to replace the tanh nonlinearity with the ReLU one. For the exploding value case, one can 
introduce a clipping technique that clips the values to a predetermined threshold, once values become 
larger than that. 

However, another technique that is usually employed in practice is to replace the previously de- 
scribed Standard RNN formulation with an alternative structure, which can cope better with such 
phenomena that are caused by the long-term dependencies. 


Remarks 18.6. 


• Deep RNNs: Besides the basic RNN network that comprises a single layer of States, extensions have 
been proposed that involve multiple layers of States, one above the other (see, e.g., [180]). 

• Bidirectional RNNs: As the name suggests, in the bidirectional RNNs, there are two state variables, 

i.e., one denoted as h , which propagates forward, and another one, h , which propagates backwards. 
In this way, the outputs are left to depend on both the past and the future (see, e.g., [65]). 


The Long Short-Term Memory (LSTM) Network 

The key idea behind the LSTM network, proposed in the seminal paper [94], is the so-called cell state, 
which helps to overcome the problems associated with the vanishing/exploding phenomena that are 
caused by the long-term dependencies within the network. 

The LSTM networks have the built-in ability to control the information flow into and out of the 
systenTs memory via nonlinear elements known as gates. These gates are implemented via the logistic 
sigmoid nonlinearity and a multiplier. From an algorithmic point of view, the gates are equivalent to 
applying a weighting on the related information flow. The weights lie in the [0,1] range and depend on 


5 Recall, rfta ^ w = 1 - tanh 2 (;c). 
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FIGURE 18.35 

The LSTM unit. Note that in this case, there are two types of memory-related variables that are propagated, i.e., the 
cell state vector, s, and the hidden variables vector, h. The involved bias vectors are not shown to unclutter notation. 


the values of the involved variables that activate the sigmoid nonlinearity. In other words, the weighting 
(control) of information takes place in context. According to such a rationale, the network has the agility 
to forget information that has already been used and is no more needed. The basic LSTM cell/unit is 
shown in Fig. 18.35. It is built around two sets of variables, stacked in the vector s, which is known 
as the cell or unit state, and the vector h, which is known as the hidden variables vector. An LSTM 
network is built upon the successive conccitenation of this basic unit. The unit corresponding to time 
n, besides the input vector, x n , receives s n -\ and h n -\ from the previous stage and passes s„ and h n 
to the next one. 

The associated updating equations are summarized below, 

/ = a (u f x n + W f h n -\ + , 

i = a (lJ'x n + W'h n -\ + b l V 

s = tanh (U s x n + W s h n _\ + b s ), 

o = a (U°x n + W°h n -\ + b °), 

Sn = S„-[o/ + ioS, 

h n = ootanh(s„), 

where o denotes element-wise product between vectors or matrices (Hadamard product), that is, (s o 
/), = sj fi, and a denotes the logistic sigmoid function. 

Observe that the cell state, s, passes direct information from the previous instant to the next one. 
This information is first controlled by the first gate, according to the elements in /, which take values 
in the range [0,1], depending on the current input and the hidden variables that are received from the 
previous stage. This is what we said before, i.e., that the weighting is adjusted in “context.” Next, new 
information, i.e., s, is added to ,s'„_i, which is also controlled by the second sigmoid gate network 
(i.e., i). Thus, there is a guarantee that information from the past is forwarded to the future in a direct 
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way, which helps the network to memorize information. It turns out that this type of memory exploits 
better the long-range dependencies in the data, compared to the basic RNN structure. The hidden 
variables vector h is controlled by both the cell state and the current values of the input and the previous 
state variables. All the involved matrices and vectors are learned via the training phase. Note that there 
are two lines associated with h n . The one to the right leads to the next stage and the one on the top is 
used to provide the output, y n , at time n, via, say, the softmax nonlinearity, as in the Standard RNNs in 
Eq. (18.68). 

Besides the previously discussed LSTM structure, a number of variants have been proposed. An 
extensive comparative study among different LSTM and RNN architectures can be found in, e.g., 

[68,113]. 

RNNs and LSTMs have been successfully used in a wide range of applications, such as language 
modeling (e.g., [222]), machine translation (e.g., [147]), speech recognition (e.g., [66]), machine vision 
for the generation of image descriptors (e.g., [115]), and in fMRI data analysis in order to grasp the 
time dynamics in the associated brain networks (e.g., [207]). For example, in language processing, 
the input is typically a sequence of words, which are encoded as numbers (these are pointers to the 
available dictionary). The output is the sequence of words to be predicted. During training, one sets 
y n = x n +i. That is, the network is trained as a nonlinear predictor. 

18.13.2 ATTENTION AND MEMORY 

The use of attentiori schemes in neural networks has a rather long history (see, e.g., [41]). As the name 
suggests, the concept of “attention” draws on the idea of the attention mechanism found in humans. 
For example, our vision system provides us with the ability to focus more on the most important 
information that resides in a scene; this information is in context; that is, it depends on what we have 
in mind to look at. In machine learning, a number of different models have been proposed on how one 
can implement the attention concept. 

One of the most popular paths is to apply a type of weighting (transformation) on various variables 
on which the output depends. These weights are learned during training. Fet us take as an example an 
RNN. In its basic form, as has been discussed before, the output, y n , depends on the corresponding 
state vector, h n . However, although the state vector encodes/summarizes the systenTs memory up to 
the most recent time instant, n, this may not necessarily be the most important information that is 
needed in certain tasks. As a matter of fact, it is rather unreasonable to assume that in a long input 
sequence in a machine language translation system, for example, the most recent state vector is the 
most representative information to get a reliable output. A typical case to demonstrate the previous 
statement is when one translates from Japanese to English, where the last word of a Japanese sentence 
could be highly predictive of the first word in its English translation. Similarly in life, what action 
we decide to take at a specific response heavily depends on our total previous experience. Yet, some 
specific experiences in the past may have a much stronger influence than the most recent ones. 

To deal with such cases, one can employ an attention mechanism so that the output, at time n, 
depends on a weighted combination of all the previous, to time n, state vectors and leave to the system 
to learn the values of the weights during the learning phase. So, during training, it will be decided what 
is the most important piece of information which the output should be based on. That is, the system 
learns to “attend” to the most important contextual information. 
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For example, the output vector may be modified to depend on ali the previously computed state 
vectors, i.e., 

/ " 

y„ = f I Y^Unihi +C 

\i'=l 

where a„j are the corresponding weights at time n. The above idea of combining ali previous state 
vectors has been employed, in a somewhat different formulation, in the machine translation system 
that is described in [8] (see also Section 18.18). 
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FIGURE 18.36 

The values of the attentiori weights in grayscale, showing the degree of dependence of the words in the output se- 
quence (English) to those of the input sequence (French) (from [ 8 ]). Note, for example, that in order to produce the 
word “Syria,” the network “attends” the words “La Syrie.” 

Fig. 18.36, taken from the previous reference, illustrates the rationale of employing a weighting 
mechanism. The French words present the input and the English words the corresponding output se- 
quences. The values of the corresponding attention weights are visualized as pixels; the larger the 
weight, the whiter the pixel. Note, for example, that the output word “produce” is the resuit of weight¬ 
ing information from three successive time instants, associated with the words “peut plus produire,” 
and the word “destruction” with two words, i.e., “la destruction.” 

In [253], a network with attention mechanism is used for the automatic generation of image de- 
scription. In the described system, a variant of a CNN network is employed to generate a set of feature 
vectors. Each feature vector corresponds to a portion of the original image. This step is equivalent to 
an encoding phase of the original image. These vectors are in turn used to form the input sequence to 
an LSTM network, where attention weights have been employed. The network is trained to compute 
the output word probability given the LSTM state. Fig. 18.37, taken from [253], shows the part of the 
image that the model “attends” while generating a word. 

An interesting aspect of integrating an attention mechanism within the model is that one can follow 
what the model does and how the output information is formed; this can be useful when the issue of 
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FIGURE 18.37 

The original image is shown at the top left. The resulting image description is “A person is standing on the beach 
with a surfboard.” The use of attention weights highlights the corresponding part of the image, where the output 
word more heavily depends on. As an example, observe that while the output produces the word “surfboard,” the 
attention weights that are associated with the pixels within the corresponding region in the image get the largest 
values (from [253]). 


interpretability of the network becomes important. That is, to understand “why” and “how” the network 
makes a decision. For a related discussion, see, e.g., [206]. 

Different paths to attention are also possible. In [67], the so-called neural Turing machine (NTM) 
is proposed, where a memory module is applied in parallel with the neural network (feed-forward or 
LSTM). A learnable attention mechanism is used to read and write to the memory selectively. An 
extension of the NTM that involves reinforcement learning techniques is given in [259]. In [221,251], 
a memory generated by the input data is allowed to be read multiple times before producing an output. 
This is in analogy to making multiple reasoning steps based on the content of the memory; that is, 
based on the input “story.” 

Remarks 18.7. 

• Reservoir computing: The term reservoir computing refers mainly to two closely related fami- 
lies of recurrent networks that have been independently proposed, i.e., the echo state networks 
(ESNs) [ i 06] and the liquid state machines [149]. The latter implements spiking neurons instead of 
continuous-valued neurons. 
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The main idea behind the original ESN is to train only the parameters associated with the output 
neurons. The weights associated with the input and the state vectors are generated randomly, fol- 
lowing certain rules. The untrained part is called the reservoir and the resulting States echoes. The 
rationale springs forth from the idea that if a random RNN possesses certain properties, training the 
output parameters suffices. The main property that the reservoir should possess is the so-called echo 
state property. This is basically a stability condition related to the dynamics of the network [144]. 
A nonparametric Bayesian formulation is discussed in [29]; a prior distribution is imposed over the 
output weights, which are in turn marginalized out in the context of prediction generation, given the 
training data. 


18.14 ADVERSARIAL EXAMPLES 

At the moment this edition is compiled, deep neural networks are state of the art in achieving per- 
formance and accuracies that are often comparable to, and sometimes better than, those achieved by 
humans. However, it seems that we are not yet in a position to claim that these models truly “under- 
stand” the task they have “learned” to perform; this is in spite of the fact that, for example, they can 
predict correct labeis in classification tasks with very high probability. In [227], it is demonstrated that 
one can construet adversarial examples that consistently fool machine learning models. The term “ad¬ 
versaria!” means that one can intentionally impose small worst-case perturbations on patterns in the 
input set, which will resuit in wrong label prediction with high probability. The most interesting issue 
is that adding this small noise perturbation is hardly perceptible to the human eye, in case of images 
(e.g., [227]), and to the human ear, in case of music (e.g., [118,220]). Fig. 18.38 (taken from [227]) 
shows nine images in total. A neural network (AlexNet) has been trained to recognize the content of 
images. All three images on the left, taken from the respective test set, were recognized correctly. The 
images in the middle are noise images that are added to the corresponding ones on the left. The re¬ 
sulting images are shown on the right. No human has any difficulty to predict the correct label. Yet, 
AlexNet classified all three images to the class “ostrich, struthio camelus”! 

There are various ways to generate adversarial examples. In [227], an optimization task is employed 
that finds the minimum perturbation that can lead to a change of a label. In [63], the perturbation is 
performed in the direction of the sign of the gradient of the cost function with respect to the input 
pattern. In [164], a method is proposed to construet a universal small perturbation that can cause all 
images in a data set to be misclassified with high probability. 

It seems that at the heart of this “strange” behavior lies the high dimensionality of the input space. 
In general, one expects that in a learning task, the smoothness assumption is valid. That is, for small 
enough positive e and an input pattern x, we would expect that for any v ■ IMI the pattern x' := 
x + v is assigned in the same class as x, with high probability. The effect of the high dimensionality 
on the smoothness condition can easily be seen for the case of a linear classifier (see, e.g., [63]). Let 
the trained classifier be described in terms of its parameters, 6. Given an input pattern, x, the label is 
computed according to the sign of the inner product, 6 1 x. For the case of x', the inner product is given 
by 0 1 x' = 6 r x + 6 l v. Let us now intentionally set v = ±e sgn(0), where the sign operation acts in an 
element-wise fashion. Then it turns out that 
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FIGURE 18.38 

The images on the left have been classified correctly. Ali the images on the right have been classified as “ostrich, 
Struthio camelus”! The images in the center show the noise (after some magnification) added to obtain the images 
on the right (example taken from [227]). 


I 

0 T x' = 0 T x + 0 T v = 0 T x±eJ2 |0/|. 

i=l 

Hence, if the input dimensionality l is large, large deviations are expected between the values of the 
respective inner products and this can lead to different predicted labeis for x and x'. In words, the 
combination of linearity with high dimensionality violates the smoothness assumption. 

The above explanation can be extended to the deep networks, when, for example, ReLU is employed 
or when the involved nonlinearities operate in their linear region. An alternative geometric viewpoint in 
explaining the adversarial phenomenon is provided in [49,164]. There, it is pointed out that at the heart 
of the adversarial examples lie some distinet geometric properties that are associated with the imposed 
perturbation in relation to geometric correlations between different parts of the decision boundary. 
Their analysis demonstrates the difference that exists between a general type of random noise and a 
worst-case adversarial type of perturbations. 

The question that comes in mind is whether a more careful sampling of the input space would resuit 
in a richer representation, where adversarial examples could be included in the training set and hence 
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the network can leam them. As claimed in [227], the set of adversarial negatives is of extremely low 
probability, and thus is never (or rarely) observed in the data set, yet it is dense (much like the rational 
numbers); so, it can be found near to virtually every test case. It should be noted, however, that such a 
statement is not substantiated by a theoretical proof. 

ADVERSARIAL TRAINING 

Having constructed adversarial examples and making some attempts to understand their existence, the 
next front is focused more on the practical aspects of the phenomenon. Although it seems that adver¬ 
sarial examples are far from common in the input (training and test sets) data, it is a rather disturbing 
phenomenon. Moreover, it can always be used to fool a network intentionally. To this end, a number 
of techniques have already appeared in an attempt to “robustify” the networks against adversaries. 

In [227], adversarial examples were generated and fed back to the training set. This is a type of 
regularization via data generation. However, in [118], it is argued that in the case of music data the 
method did not really improve performance. 

In [63], the loss function J is modified appropriately as 

j'(0, x, y) — aJ(0, x, y) + (1 — a)J(0, x + Ajc, y), 0 < a < 1, 


where 


Ax — € sgn 



J(0, x, y ) 


6 > 0 , 


is shown to be a direction for adversarial perturbation. 

In [161], a regularizer is used to promote smoothness of the model distributions with respect to 
the input, around every input data point. In [209], a robust optimization method is proposed, which is 
built around a minmax formulation, where the cost function is optimized with respect to a worst-case 
realization of a perturbation. In [175], the distillation technique is proposed as a way to cope with ad¬ 
versarial examples. Distillation is a training procedure initially designed to train a deep neural network 
in the context of transfer learning (see Section 18.17), e.g., [92]. In the adversarial training framework, 
it is claimed that distillation can reduce the gradients that lead to adversarial sample creation by many 
orders of magnitude. Moreover, distillation can significantly increase the average minimum number of 
features that need to be modified to create adversarial samples. 

As a final touch, it must be said that at the time the current edition is being compiled, adversar¬ 
ial examples constitute an ongoing hot topic of research. Adversarial examples have the potential to 
be dangerous. Consider, for example, an attacker who targets autonomous vehicles by using stickers 
or paint to create an adversarial “stop” sign that the vehicle would interpret as a “yield” or another 
sign (see, e.g., [ 177]). In [127], it is demonstrated that feeding adversarial images obtained from a cell 
phone camera to an ImageNet inception classifier, a large fraction of adversarial examples were mis- 
classified, even when perceived through the camera. In [176], a threat model is proposed where attacks 
and defenses within an adversarial framework are categorized. It is shown that there are (possibly un- 
avoidable) tensions between model complexity, accuracy, and resilience that must be calibrated for the 
environments in which they will be used. 

In [208], it is discussed that in spite of a wide range of defenses that have been proposed to robustify 
neural networks against adversarial attacks, it seems that such defenses are quickly broken. Moreover, 
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the theoretical analysis in the paper indicates that for certain classes of problems, adversarial examples 
are inescapable. 


18.15 DEEP GENERATIVE MODELS 

So far, our emphasis was on supervised learning via neural network architectures. The other main 
direction in machine learning is that of unsupervised learning, where one has to “learn” and unveil the 
underlying dependencies and structure that are hidden in the input data. Clustering, as introduced in 
Chapter 12 (see also [231] for an extensive coverage) is a major path of learning via unlabeled data. 
Another one is to learn explicit probabilistic dependencies with the final goal being that of estimating 
probability distributions, as was the focus in Chapters 15 and 16. 

Another direction, where the use of unlabeled samples is of importance, is that of learning an 
efficient representation of the input data. As a matter of fact, feature learning is a facet of data rep- 
resentation. The convolutional layers in a CNN, which precede the fully connected network part, are 
dedicated to this goal; that is, to obtain an efficient and information-rich representation of the input 
data. However, in the framework of CNNs, this takes place in the context of a specific learning task; 
that is, the output labeis are also considered during training to reach an optimal representation that 
serves the needs of the task at hand. 

In this section, the discussion setting is to learn representations that are independent of a specific 
target task; the goal is to extract such information using the input data only. The reason for such a focus 
is twofold. First, learning a model representation of the input data can be used subsequently in different 
tasks in order to facilitate the training. Sometimes, this is also known as pretraining, where parameters 
learned using unlabeled data can be used as initial estimates of the parameters for another supervised 
learning. This can be useful when the number of labeled examples is not large enough (see, e.g., [ 148] 
for a discussion). It is worth pointing out that such a pretraining rationale is of a historical importance, 
because it led to the revival of neural networks, as will be discussed soon [86]. 

Another path, which is currently of high interest, is to exploit such learned representations in order 
to generate new data; recall the discussion in Section 18.7. Such techniques may not necessarily learn 
the underlying probability distribution explicitly , yet they acquire the necessary knowledge needed to 
be able to draw samples according to the distribution that “explains” the data. 

18.15.1 RESTRICTED BOLTZMANN MACHINES 

A restricted Boltzmann machine (RBM) is a special type of the more general class of Boltzmann 
machines, which were introduced in Chapter 15 [1,217]. Fig. 18.39 shows the probabilistic graphical 
model corresponding to an RBM. There are no connections among nodes of the same layer. Moreover, 
the upper level comprises nodes corresponding to hidden variables and the lower level consists of 
visible ones. That is, observations are applied to the nodes of the lower layer only. Deep RBMs can be 
constructed by stacking one on top of the other. 

Following the general definition of a Boltzmann machine, the joint distribution of the involved 
random variables is of the form 

P(y\,...,vj,h\,...,hi)= ^exp(— E(v, h)), 


(18.71) 




18.15 DEEP GENERATIVE MODELS 989 



FIGURE 18.39 

An RBM is an undirected graphical model with no connections among nodes of the same layer. The lower level 
comprises visible nodes and the upper layer consists of hidden nodes only. 


where we have used different symbols for the J visible (v/, j = 1,2,... J) and the I hidden variables 
(h/, i = l,2,..., /). The energy is defined in terms of a set of unknown parameters, 6 that is, 

i J i J 

e(v, h)=-j2J2 0 u h i v J -J2 b ‘ hi -J2 c j v j- ( 18 - 72 > 

i=l j= 1 i=l 7=1 

The normalizing constant is obtained as 

Z = EE e *P(-£(^))- (18.73) 

» h 

We will focus on discrete variables; hence the involved distributions are probabilities. More speciii- 

cally, we will focus on variables of a binary nature, that is, vj, h,- e {0, 1}, j = I ,./. i = I . 

Observe from Eq. (18.72) that, in contrast to a general Boltzmann machine, only products between 
hidden and visible variables are present in the energy term. 

The goal in training an RBM is to learn the set of unknown parameters, djj, bj. Cj, which will 
be collectively denoted as 0, b, and c, respectively. A major path to this end is to maximize the 
log-likelihood, using N observations of the visible variables, denoted as v„, n = 1,2,..., N, where 

tbi •— [tllni ■ ■ ■ i Vjn] 

is the vector of the corresponding observations at time n. We will say that the visible nodes are clamped 
on the respective observations. The corresponding (average) log-likelihood is given by 

1 N 

L(®,b,c) — — y^lnP(i> n ; 0, b, c) 

n— 1 

1 N 1 

n= 1 h 


6 Compared to the notation used in Section 15.4.2 we use a negative sign. This is only to suit better the needs of the section, 
and it is obviously of no importance for the derivations. 
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= — X! ln (X! ex P(~ ®,b,c))j 

n =1 h 

— ln^^ exp ( — E(v,h )), 

» h 

where the index n in the energy refers to the respective observations onto which the visible nodes have 
been clamped, and © has explicitly been brought into the notation. 

Taking the derivative of L(&,b,c ) with respect to 0, j (similar is the case of the derivatives with 
respect to b, and Cj ) and applying Standard properties of derivatives, it is not difficult to show (Problem 

18.13) that 



where the following has been used: 


( 18 . 74 ) 


P(h\v) = 


P(v, h) 

E h 'P(v,h'y 


The gradient in (18.74) involves two terms. The hrst one can be computed once P(h\v) is available. Ba- 
sically, this term is the mean firing rate or correlation when the RBM is operating in its clamped phase; 
often, we refer to it as the positive phase, and the term is denoted as < h/V/ > + . The second term is the 
corresponding correlation when the RBM is working in its so-called free running or negative phase, 
and it is denoted as < h, v ; >“. Thus, a gradient ascent scheme for maximizing the log-likelihood will 
be of the form 


0,y(new) = 6ij (old) + pt (< h/vy > + — < h;v/ > ). 

Let us take a minute to justify why we have named the two phases of operation as positive and 
negative, respectively. These terms appear in the seminal papers on Boltzmann machines by Hinton 
and Senjowski [81,83]. The hrst one, corresponding to the clamped condition, can be thought of as a 
form of a Hebbian learning rule. Hebb was a neurobiologist and stated the hrst ever (to the best of my 
knowledge) learning rule [78]: “If two neurons on either side of a synapse are activated simultaneously, 
the strength of this synapse is selectively increased.” Note that this is exactly the effect of the positive 
phase correlation in the parameter’s update recursion. On the contrary, the effect of the negative phase 
correlation term is the opposite. Thus, the latter term can be thought of as a forgetting or unlearning 
contribution; it can be considered as a control condition of a purely “internal” nature (note that it does 
not depend on the observations), compared to the “external” information received from the environment 
(observations). 

Details concerning the optimization of the log-likelihood as well as the finally derived algorithm 
can be obtained from the site of the book under the Additional Material part that is associated with the 
current chapter. 
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18.15.2 PRETRAINING DEEP FEED-FORWARD NETWORKS 

This subsection presents the rationale of employing RMBs to pretrain a deep forward neural network. 
This concept was first presented in [86]. Although pretraining is no more widely used, it can stili offer 
advantages in cases where labeled data are not enough to train a big neural network. 

Fig. 18.40 presents a block diagram of a deep neural network with three hidden layers. The vector of 
the input random variables is denoted as x and those associated with the hidden ones as h' , f = 1,2,3. 
The vector of the output nodes is denoted as y. Pretraining evolves in a sequential fashion, starting 
from the weights connecting the input nodes to the nodes of the first hidden layer. This can be achieved 
by maximizing the likelihood of the observed samples of the input observations, x , and treating the 
variables associated with the first layer as hidden ones. Once the weights corresponding to the first 
layer have been computed, the respective nodes are allowed to fire an output value and a vector of 
values, hr, is formed. This is the reason that a generative model for the unsupervised pretraining is 
adopted (such as the RBM), to be able to generate in a probabilistic way outputs at the hidden nodes. 
These values are in turn used as observations for the pretraining of the next hidden layer, and so on. 

Once pretraining has been completed, a supervised learning rule, such as backpropagation, is then 
employed to obtain the values of weights leading to the output nodes, as well as to fine-tune the weights 
associated with the hidden layers, using as initial weight values those obtained during the pretraining 
phase. 



Output 


Phase 3 


Phase 2 


Phase 1 


FIGURE 18.40 

Block diagram of a deep neural network architecture, with three hidden layers and one output layer. The vector of 
the input random variables at the input layer is denoted as x. The vector of the variables associated with the nodes 
of the /th hidden layer is denoted as h' , / = 1,2,3. The output variables are denoted as y. At each phase of the 
pretraining, the weights associated with one hidden layer are computed, one at a time. For the network of the figure, 
comprising three hidden layers, pretraining consists of three stages of unsupervised learning. Once pretraining of 
the hidden units has been completed, the weights associated with the output nodes are pretrained via a supervised 
learning algorithm. During the final fine-tuning, ali the parameters are estimated via a supervised learning rule, such 
as the backpropagation scheme, using as initial values those obtained during pretraining. 
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As already said before, unsupervised learning is a way to discover and unveil information hidden 
in the data, by learning the undeiiying regularities and the statistical structure of the data. In this way, 
pretraining can be thought of as a data dependent regularizer that pushes the unknown parameters to 
regions where good Solutions exist, by exploiting the extra information acquired by the unsupervised 
learning (see, for example, [47]). 

Fig. 18.40 illustrates a multilayer perceptron with three hidden layers. As is always the case with 
any supervised learning task, the kick-off point is a set of training examples, ( y n , x n ), n = 1.2,..., /V. 
Training a deep multilayer perceptron, employing what we have said before, involves two major phases: 
(a) pretraining and (b) supervised fine-tuning. Pretraining the weights associated with hidden nodes 
involves unsupervised learning via the RBM rationale. Assuming K hidden layers, h*, k — 1,2,..., K , 
we look at them in pairs, that is, (h A_1 , h k ), k — 1,2,..., K, with h° := x being the input layer. Each 
pair will be treated as an RBM, in a hierarchical manner, with the outputs of the previous one becoming 
the inputs to the next. It can be shown (for example, [86]) that adding a new layer each time increases 
a variational lower bound on the log-probability of the training data. 

Pretraining of the weights leading to the output nodes is performed via a supervised learning algo- 
rithm. The last hidden layer together with the output layer is not treated as an RBM, but as a one-layer 
feed-forward network. In other words, the input to this supervised learning task are the features formed 
in the last hidden layer. 

Finally, fine-tuning involves retraining in a typical backpropagation algorithm rationale, using the 
values obtained during pretraining for initialization. This is very important for getting a better feeling 
and understanding of how deep learning works. The label information is used in the hidden layers only 
at the fine-tuning stage. During pretraining, the feature values in each layer grasp information related 
to the input distribution and the underlying regularities. The label information does not participate 
in the process of discovering the features. Most of this part is left to the unsupervised phase, during 
pretraining. Note that this type of learning can also work even if some of the data are unlabeled. 
Unlabeled information is useful, because it provides valuable extra information concerning the input 
data. As a matter of fact, this is at the heart of semisupervised learning (see, e.g., [231 ]). 

More details on RBM-based pretraining can be found on the site of the book under the Additional 
Material part associated with the current chapter (see also [88]). 

18.15.3 DEEP BELIEF NETWORKS 

The emphasis given in this chapter, so far, was on discussing feed-forward architectures. Our focus 
was on the information flow in the feed-forward or bottom-up direction. However, this is only part of 
the whole story. The other part concerns training generative models. The goal of such learning tasks 
is to “teach” the model to generate data. One way to achieve this is via learning probabilistic models 
that relate a set of variables, which can be observed, with another set of hidden ones. RBMs are just 
an instance of such models. Moreover, it has to be emphasized that RBMs can represent any discrete 
distribution if enough hidden units are used [52,137]. 

In our discussion up to now in this section, we viewed a deep network as a mechanism forming 
layer-by-layer features of features, that is, more and more higher-level representations of the input 
data. The issue now becomes whether one can start from the last layer, corresponding to the highest 
level representation, and follow a top-down path with the new goal of generating data. Besides the 
need in some practical applications, there is an additional reason to look at this reverse direction of 
information flow. 
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Some studies suggest that such top-down connections exist in our visual system to generate lower- 
level features of images starting from higher-level representations. Such a mechanism can explain the 
creation of vivid imagery during dreaming, as well as the disambiguating effect on the interpretation 
of local image regions by providing contextual prior information from previous frames (for example, 

[134,135,165]). 

A popular way to represent statistical generative models is via the use of probabilistic graphical 
models, which were treated in Chapters 15 and 16. A typical example of a generative model is that of 
sigmoidal networks, introduced in Section 15.3.4, which belong to the family of parametric Bayesian 
(belief) networks. A sigmoidal network is illustrated in Fig. 18.41 A, where a directed acyclic graph 
(Bayesian) is shown. Following the theory developed in Chapter 15, the joint probability of the ob- 
served (x) and hidden variables, distributed in K layers, is given by 

p(x,fc i ,...,* jc )=p(xi* i )^n p ( kk \ hk+i ) j p^ k ). 


where the conditionals for each one of the 4 nodes of the kth layer are defined as 

P{h\\h k+l ) = a ^X^,7 +1 ^ +1 j ’ k=\,2,..., K — i = 1,2,..., 4 ■ 

A variant of the sigmoidal network was proposed in [86], which has become known as deep belief 
network. The difference with a sigmoidal one is that the top two layers comprise an RBM. Thus, it is a 
mixed type of network consisting of both directed and undirected edges. The corresponding graphical 
model is shown in Fig. 18.41 B. The respective joint probability of all the involved variables is given 
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FIGURE 18.41 

(A) A graphical model corresponding to a sigmoidal belief (Bayesian) network. (B) A graphical model corre¬ 
sponding to a deep belief network. It is a mixture of directed and undirected edges connecting nodes. The top layer 
involves undirected connections and it corresponds to an RBM. 
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by 

/>(x,A 1 ,...,A Jf ) = P(x|A 1 )^ri p(h K ~\h K y (18.75) 

It is known that learning Bayesian networks of relatively large size is intractable, because of the 
presence of converging edges (see Section 15.3.3). To this end, variational approximation methods can 
be mobilized to bypass this obstacle (see Section 16.3). 

In [86], an alternative path was proposed following a scheme similar to that used before for pretrain- 
ing neural networks. In other words, all hidden layers, starting from the input one, are treated as RBMs, 
and a greedy layer-by-layer pretraining bottom-up philosophy is adopted. It should be emphasized that 
the conditionals, which are recovered by such a scheme, can only be thought of as approximations of 
the true ones. After all, the original graph is a directed one and not undirected, as the RBM assumption 
imposes. The only exception lies at the top level, where the RBM assumption is a valid one. 

Once the bottom-up pass has been completed, the estimated values of the unknown parameters 
are used for initializing another hne-tuning training algorithm; such a scheme has been developed in 
[85] for training sigmoidal networks and is known as wake-sleep algorithm. The objective behind the 
wake-sleep scheme is to adjust the weights during the top-down pass, so as to maximize the probability 
of the network to generate the observed data. The scheme has a variational approximation flavor, and 
if initialized randomly takes a long time to converge. However, using the values obtained from the 
pretraining for initialization, the process can signihcantly be sped up [89]. 

Once training of the weights has been completed, data generation is achieved by the scheme sum- 
marized in Algorithm 18.4. 

Algorithm 18.4 (Generating samples via a DBN). 

• Obtain samples h K ~ l , for the nodes at level K — 1. This can be done via running a Gibbs chain, by 
alternating samples, h ~ P(h\h ) and h ~ P(h\h ). This can be carried out m analogy to 
the technique used to train RBMs (see the Additional Material for the current chapter on the book’s 
site), as the top two layers comprise an RBM. The convergence of the Gibbs chain can be sped 
up by initializing the chain with a feature vector formed at the ( K — l)th layer by one of the input 
pattems; this can be done by following a bottom-up pass to generate features in the hidden layers, 
as the one used during pretraining. 

• For k = K — 2,..., 1, Do; Top-down pass. 

- For i = 1, 2,..., Ik, Do 

* h k r' ~ P ( hi\h k y .; Sample for each one of the nodes. 

- End For 

• End For 

• jc = /i°; Generated pattern. 


18.15.4 AUTOENCODERS 

Autoencoders have been proposed in [9,199] as methods for dimensionality reduction. An autoencoder 
consists of two parts, the encoder and the decoder. The output of the encoder is the reduced represen- 
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tation of the input pattern, and it is defined in terms of a vector function, 

f : x eR 1 h eR m , 


(18.76) 


where 

hi := ft(x) = <p e (0f x + b{), i — 1,2,... ,m, (18.77) 

with <p e being the activation function; the latter is usually taken to be the logistic sigmoid function, 
<j) e (-) = a(■). In other words, the encoder is a single hidden layer feed-forward neural network. 

The decoder is another function g. 


g:heR m \—+xeR l , (18.78) 

where 

xj = gj (h ) = cpclW/h + b)), 7 = 1,2,...,/. (18.79) 

The activation (j),i is, usually, taken to be either the identity (linear reconstruction) or the logistic sig¬ 
moid one. The task of training is to estimate the parameters 

©:=[0i,...,0 m ,], b , 0' ■.= [0' l ,...,0' l \, b'. 

It is common to assume that (-)' = Q 7 . The parameters are estimated so the reconstruction error, 
e = x — x, over the available input samples is minimum in some sense. Usually, the least-squares cost 
is employed, but other choices are also possible. Regularized versions, involving a norm of the param¬ 
eters, is also a possibility (for example, [193]). If the activation (p e is chosen to be the identity (linear 
representation) and m < l (to avoid triviality), the autoencoder is equivalent to the PCA technique [9]. 
PCA is treated in more detail in Chapter 19. 

Another version of autoencoders results if during training one adds noise to the input [238,239]. 
This is a stochastic counterpart, known as the denoising autoencoder. For reconstruction, the uncor- 
rupted input is employed. The idea behind this version is that by trying to undo the effect of noise, one 
captures statistical dependencies between inputs. More specifically, in [238], the corruption process 
randomly sets some of the inputs (as many as half of them) to zero. Hence, the denoising autoencoder 
is forced to predict the missing values from the nonmissing ones, for randomly selected subsets of 
missing patterns. 

Autoencoders have also been used for pretraining deep networks, in place of the RBMs discussed 
before (see, e.g., [87]). In the latter, autoencoders with many layers, instead of a single one, have been 
employed. 

18.15.5 GENERATIVE ADVERSARIAL NETWORKS 

In Sections 18.15.1 and 18.15.3, we considered generative probabilistic graphical models forprobabil- 
ity distribution estimation as well as for data generation. The major drawback of such models comes 
from the computational intractability when maximizing the associated likelihood function and related 
costs in order to compute the unknown parameters that define the graphical model. 
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An alternative path for designing generative models was first described in [61]. The breakthrough 
concept was to abandon the idea of modeling the probability distribution directly, instead, a game the- 
oretic scenario was adopted, where the generator network was left to compete against an cidversary. 
The essence behind the method is to “make” the generator to produce examples that are indistinguish- 
able from the available observations, which are used for training. Fig. 18.42 illustrates the main idea 
of such networks, known as generative adversarial networks (GANs). In the figure, there are two net- 
works, each of them being a deep neural network, i.e., the generator (G) and the discriminator (D) 
network. 



FIGURE 18.42 

One class comprises the real images and the other the fake ones. The latter are produced by the generator, which 
is excited by noise. The output of the discriminator represents the probability of an input pattern to originate from 
the real data class. If the classes were perfectly separable, the discriminatori output, y, would be equal to 1 for 
real images and 0 for the fake ones. The goal of training a GAN is to “confuse” the discriminator so that its output 
becomes 1 /2 both for the real and for the fake ones. 

The generator is fed to its input with noise samples, according to a probability distribution p z (z), 
which is typically a uniform or a Gaussian PDF. The generator transforms the input noise vector, z, into 
a sample x = G(z; 6 g ) of the same dimension as that of the available observations, x n ,n= 1, 2,..., N. 
The parameter vector d g comprises ali the parameters that define the generator neural network and 
these parameters have to be estimated during the training phase. The goal of the training is to compute 
the parameters so that the generated patterns, jc, are statistically indistinguishable from the available 
observations. From now on, we will refer to the observations as real and those that are produced by the 
generator as fake data. 

The discriminator is a binary classifier that is implemented as a deep neural network. Its input is fed 
with samples x and it outputs a corresponding probability value, y = I)(x: 0,/). The parameter vector 
Od comprises ali the parameters that define the associated neural network and are learned during train¬ 
ing. The output value, y, represents the probability that the corresponding input pattern, x , originates 
from the observations rather than from the generator. In other words, D(x: 0,/) is the probability of jc 


7 Note that the notion of adversary here is used in a different context than that of the adversarial examples, discussed in Sec- 
tion 18.14. 
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being real, and l — D(x: 0 d ) is the probability of x being fake. If the discriminator were designed to 
make perfect decisions, then the output should be 1 for the real data and 0 for the fake ones. 

During training, the parameter vectors are optimally estimated so that the discriminator confuses 
the real with the fake data points. To this end in [61], the following two-player minmax game value 
(cost) is adopted: 


minmax J{6„,6 d ), 
e g e d 


(18.80) 


where 


J(0 g ,0 d ) = E x ~ PrW [ln D(x; 9 d )]+E^ Mt) [ln(l - D(G( z; 0 g )- 0 d ))], (18.81) 


and p r (x) denotes the probability distribution associated with the real data (subscript r reminds us of 
real). 

In words, if 0 g is kept fixed, then the discriminator is trained, via 0 d , so that its output is maxi- 
mized both for real (D(x: 0 d )) and for fake (1 — D(G(z'. 0 g )\ 0 d )) examples. Furthermore, keeping 0 d 
fixed, G is simultaneously trained through 0 g to minimize 1 — D(G(z', 0 g )', 0 d ), in order to confuse the 
discriminator. 

Algorithm 18.5 summarizes the main concept, adopting a minibatch (of size K ) optimization ratio¬ 
nale. 

Algorithm 18.5 (The GAN algorithm). 

• Initialize 0^ ] and 

• Set the minibatch size equal to K. 

• Set the number of iterations, m, for the discriminator; the simplest case is m = 1, and it is used in 
the original paper. 

• While 6 d and 6 g have not converged. Do 
- For t = 1,2,..., m, Do 

* Sample z (l \ i = 1,2,..., K, from p 7 (z) 

* Sample * ( '\ i = 1,2,..., K, from p,-{x) 

* Compute the gradient 



Update 6 d via a gradient ascent scheme 


6 d 0 d 


- End For 

• Sample z (l \ i = 1,2,.... K, from p z (z) 
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• Compute the gradient 

1 K 

Vs 4 ^ E ln l 1 - ^>(GCz (0 ; 0 g y, Odi )} 

A i=i 

• Update 0 g via a gradient descent scheme 


Og Og 


• EndWhile 

Remarks 18.8. 

• In [61], it is pointed out that rather than training G to minimize ln(l — D(G(z))), it is better to 
train G to maximize ln(D(G(z)). This turned out to deal better with the involved gradients. Further 
theoretical justification for this modification, together with comparative experimental results, can 
be found in [51]. 


On the Optimality of the Solutiori 

The first question that is now raised concerns the optimal solution associated with the minmax opti- 
mization task in (18.81). Note that the generator G implicitly defines a probability distribution, p g (x), 
of the samples x = G(z; 0 „) that are generated if z ~ p z (z). 

For our analysis, we will free ourselves from the parametric modeling of the involved functions via 
0 g and f),i, and we will study the optimal solution in terms of functions D(x) and G(z), considered in 
a general nonparametric formulation. Under such a scenario, Eq. (18.81) is rephrased as 

minmax/(G, D) :=E x ~ Pr(x) [lnD(x)] + E z ~ Mz) [ln(1 - £>(G(z)))]. (18.82) 

G D 


By the definition of the expectation we can write 

J(G,D ) = J p r (x)]n(D(x))dx + j p z (z)ln(l - D(G(z)))dz 

/;v(*)ln (£>(*)) + p^(A:)ln(l — D(x))Jdx, 

where the subscript g stands for the distribution associated with the generator output. Fixing function 
G (equivalently, fixing the probability distribution function /;„), the optimal value of the discriminator’s 
output probability function D is easily derived by taking the derivative of the integrand with respect to 
the function D and setting it equal to zero (Problem 18.15). It is easily shown that 

D* = —^—. (18.83) 

Pr + Pg 

Thus, plugging the above optimal value in (18.82), the cost with respect to G of the minimax game is 
reformulated as 



C(G) := E x ~ Pr ( f ) 


' PrOQ 

ln- 

. Pr(x) + Pg(ti) 


+ E x ~ Ps (x) 


1„ 

_ Pr(x) + Pg(x) 


(18.84) 
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where the above is a function of p g . Recalling the definition of the KL divergence (Sections 2.5.2 and 
12.7), Eq. (18.84) is compactly rewritten as 

C(G) = KL(p r \\p r + p g ) + KL(pg\\p r + p g ), 

where, as we know, the KL divergence is not symmetric with respect to its arguments. Normalizing 
by 2, and taking into account the definition of the KL divergence, the above is easily rewritten as 

C(G) = - ln4 + KL (p r \ | + KL (p g \ \ , 


or 


C(G) = — ln 4 + 2iS(p r \\pg), (18.85) 

where JS denotes the Jensen-Shannon divergence between p r and p g , defined as 


)S(p r \\pg) := l -KL 


Pr + Pg \ 

2 ; 


1 

-KL 

2 



Pr + Pg \ 
2 ) 


Thus, as Eq. (18.85) suggests, the cost C(G) associated with the generator depends on the JS diver¬ 
gence between the p r of the real data and the p g that the generator implements. It can easily be checked 
that the JS divergence, in contrast to KL divergence, is a symmetric one. As conjectured in [232], it is 
exactly the use of this divergence, compared to the KL one, that gave GANs the advantage over more 
traditional maximum likelihood-based approaches. 

It can easily be seen that the JS divergence is nonnegative and the minimum value is achieved if 
and only if p r — p g . Hence, the solution of the minmax task defined in (18.82) is obtained when I) is 
given by (18.83) and when G minimizes (18.85), leading to 


p g = Pr, D* 


and C* — — ln4. 


In other words, optimality of the game is achieved when the generator learns the distribution of the true 
data and the discriminator D is “confused” and becomes unable to discriminate between the true and 
the fake data. 

Problems in Training GANs 

In the previous section, the optimality of the GANs has been established. Yet, the task is far from 
solved. The main reason is that the optimality analysis was carried out in the probability function 
space. However, in practice, we use parametric models to achieve an approximation to the above, and 
this is where problems arise. As a matter of fact, as Algorithm 18.5 suggests, in practice, optimization 
is carried out via parameter optimization and the involved gradients are computed via the backpropaga- 
tion rationale and updates follow one of the available gradient-based optimization schemes. However, 
updating the parameters of the generator and the discriminator concurrently, to solve a nonconvex 
optimization task, does not necessarily mean that the scheme converges to the optimal game solution. 
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Looking at Eq. (18.85) and following the arguments on how we reached it, one would expect that 
training the discriminator close to its optimality (so that the cost function on 0 g is a good approxi- 
mation of Eq. (18.85)) and then applying the gradient steps on 6 g is the way to proceed. However, in 
practice, this is not the case. As the discriminator improves, the updates of the generator get worse. An 
explanation of this phenomenon is provided in [5]. At the heart of this “strange” behavior lies the fact 
that parameterized approximations of p r and p g lie in low-dimensional manifolds and this makes it 
very difficult to share common supports (regions in the input space where both functions share nonzero 
values). This is illustrated in Fig. 18.43. As a consequence, the trained discriminator, instead of achiev- 



FIGURE 18.43 

In the three-dimensional space, the two straight lines are low-dimensional (linear) manifolds. It is highly unlikely 
that they share a common intersection. So if, say, p r is confined to one and p g to the other of the two lines, it is 
highly improbable for the two distributions to share a nonnegligible support. 


ing a cost according to Eq. (18.85), achieves a zero error, meaning that it can easily recognize perfectly 
the fake from the true data. This leads the cost to zero and the gradients with respect to 0 g to vanish- 
ingly small values that makes the training of GANs, via the JS cost, difficult. While training, one has 
to decide how much to train the discriminator. In other words, there is a tradeoff between inaccuracy 
and vanishing gradients for training. 

Another problem associated with training GANs is the so-called mode collapse. This means that 
although the generator may “fool” the discriminator, the generator produces “same” outputs. It seems 
that the generator gets stuck in a “small’' region, which leads to outputs of low variability. This may be 
due to the fact that the generator cannot learn sufficiently well the data distribution, and it focuses on 
some parts of it, while ignoring other parts. 

In order to bypass the aforementioned drawbacks, a number of tricks and techniques have been 
proposed (see, e.g., [203,258]). In [188], convolutional generative networks are introduced. In [154], 
least-squares arguments are adopted for the design of GANs. In [140], the concept of combining models 
is employed based on a competitive training procedure that splits the data distribution into components, 
which are approximated well by independent generative models. 
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The Wasserstein GAN 

In [6], the task of matching p r and p g in low-dimensional manifolds is systematically tackled via 
the so-called Wasserstein distance between distributions. It is first shown that when learning distribu- 
tions in low-dimensional manifolds, the divergence that measures the difference between distributions 
must have certain properties. Both IS and KL divergence do not have these properties. In contrast, the 
Wasserstein distance is a proper one in this context. 

The Wasserstein distance, or earth moving distance (EMD), between two distributions p r and p g , 
is defined as 


W(p r , p g ) = inf E( x y )~y [| |x — y| |1, Wasserstein distance. 

yen(p r ,p g ) 


(18.86) 


Although the previous definition may look a bit complicated to the unfamiliar reader, it makes a lot 
of sense; Tl(p r , p g ) is the set of ali joint distributions between the random vectors, whose marginals 
are equal to p r and p g , respectively. Intuitively, the expectation indicates how much probability mass 
has to be transferred from x to y in order to transform p r to p g . Their distance corresponds to the 
“minimum” value among ali possible joint distributions. 

In practice, Eq. (18.86) cannot easily be implemented. An alternative way is via its dual formulation, 
known as the Kantorovich-Rubinstein duality (e.g., [236]), given by 

W(p r ,p g )= sup {E x ^ r [/(x)]-E x ^[/(x)]}, (18.87) 

II/IIl<i 

where ||/||l < 1 denotes ali the 1-Lipschitz functions (see also Section 8.10.2), / : X i-»- M, i.e., 

\f(xi) - f(x 2 )\ < ||*i-* 2 ||, Vri,r 2 eT 

Thus, the Wasserstein probability distance between two distributions, in the form of Eq. (18.87), is 
given by the “maximum” in the difference of the mean values, over ali possible 1-Lipschitz functions, 
with respect to the two distributions. 

Designing a GAN according to Eq. (18.87) makes the setting different from the one that evolved 
around the cost in Eq. (18.81). The goal is no more to design a binary classifier (discriminator) to 
distinguish real from fake data, but a Lipschitz function, /. The latter is parameterized via a deep 
neural network, in terms of a set of parameters, 6 f, i.e., /(*; 0 f). The respective parameters are 
estimated by solving the task 

max { E x ~ Pr [/(x; 0 f )] - E z ^ Pz [/(G(z; 0 g )\ 0 f )] \. 
e f 

Fixing the parameters 0 f, the parameters 0 g that define the generator are estimated by minimizing the 
difference in the bracket above. Note that this gradient depends only on the second term, the first being 
independent of 0„. As the loss function decreases during the training, the Wasserstein distance gets 
smaller and the output of the generator gets closer to the real data distribution. The Wasserstein GAN 
scheme is outlined in Algorithm 18.6. 


For a nice discussion see, e.g., https://vincentherrmann.github.io/blog/wasserstein/. 
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Algorithm 18.6 (The Wasserstein GAN algorithm). 

• Initialize 0 { j )] and d®\ 

• Set the minibatch size equal to K. 

• Set the number of iterations, m, for the function parameters 0 f. 

• Set the clipping parameter c; this parameter takes care of the Lipschitz condition. Typically, c — 
0 . 001 . 

• While 0 f and 0 ,, have not converged. Do 

- For t = 1,2,..., m, Do 

* Sample z ( '\ i = 1, 2,..., K, from p z (z) 

* Sample x (l \ i = 1, 2,..., K, from p r (x) 

* Compute the gradient 

v 9/ j j E »/) - /(G(z (f) ; o g );«/))) 

A i= I 

* Update 0 f via a gradient ascent scheme 

0 f <-0 f 

* Clip the values in the interval [—c, c] 

- End For 

• Sample z (l \ i = 1, 2,..., K, from p- (z) 

• Compute the gradient 

vo g {-jY,f( G(z(i) -’ 0 ^ 0 f)\ 

i =1 

• Update 0 g via a gradient descent scheme 


Og Og 


• End While 

Remarks 18.9. 

• Note that since Algorithm 18.6 does not train a discriminator, the better the estimate of the function 
/ is, the higher the quality of the gradient with respect to 0 g is expected to be. Hence, m can be 
given a fairly large value, without having to worry about balancing the discriminator and generator, 
as was the case with Algorithm 18.5. 

• In [6], it is reported that the Wasserstein GAN leads to improved stability and robustness compared 
to previously developed schemes. 

• However, as everything in life, nothing is perfect. For example, clipping is a rather “crude” way 
to enforce the Lipschitz condition. Improvements of the basic scheme have been proposed in, e.g., 
[248] and [70]. 
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Which Algorithm Then 

So far, in our discussion on GANs, we focused on two possible algorithms and we also provided 
references to a number of other alternatives. The Wasserstein GAN helped us to outline some important 
issues that underlie the task of learning distributions (implicitly or explicitly) where, most often, the 
learning process takes place in lower-dimensional manifolds. The reader and any practitioner may 
wonder which algorithm is “best” or more suitable to be used in a practical application. 

One of the major problems associated with data generation is to be able to test the performance 
of the generator by assessing the quality of the generated patterns. The main issue in evaluating the 
performance stems from the fact that one cannot explicitly compute the distribution p g (x). Thus, 
classical costs such as the likelihood function cannot be employed. Attempts to approximate it in high- 
dimensional spaces seem to be problematic in the current context (e.g., [103]). To this end, related 
metrics have been suggested. For example, the so-called inception score (IS) builds upon the concept 
that a good model should generate samples for which, when evaluated by the discriminator, the class 
distribution has low entropy. At the same time, the generated samples should exhibit large variation 
[203]. On the other hand, the Frechet inception distance focuses on a metric that quantifies the differ- 
ence between true and fake data samples, after a specific embedding in a feature space via an inception 
network [80]. Assuming that the embedded data follow a multivariate Gaussian distribution, the dis¬ 
tance between the two distributions is quantified by the Frechet distance between the corresponding 
Gaussians. 

A comparative experimental study on various GANs has been presented in [143]. The findings 
there, using a number of metrics, including the previously stated quality measuring ones, do not seem 
to favor any one of the proposed GANs against the others. It is reported that most models can achieve 
similar scores after careful hyperparameter tuning in optimization, higher computational budget, and 
random restarts. 

However, it should be stated that at the time this edition is compiled, designing GANs is an ongoing 
research area and it is rather early to come up with definite statements. 

Example 18.5. The purpose of this example is to describe a basic experiment on GANs to generate 
images of hand-written characters. To this end, the generator as well as the discriminator networks 
have to be designed. 

• The generator network has 100 input nodes that are excited by i.i.d. Gaussian noise samples of 
zero mean and unit variance. The network comprises three hidden layers, with 256, 512, and 1024 
neurons, respectively. The nonlinearity used for the neurons of the hidden layers was the leaky 
ReLU (Section 18.6.1), with a = 0.2. There are 784 output units and the output nonlinearity was 
the tanh function. The number of output units is chosen to be equal to the size of the MNIST image 
data set, which is used for training. 

• The discriminator has 784 input nodes, to match the 1 x 784 size vector associated with the 28 x 28 
MNIST images. The discriminator consists of three hidden layers, of 1024, 512, and 256 neurons, 
respectively. The leaky ReLU unit, with a = 0.2, was used as the activation function for the hidden 
layer neurons. The output is a single binary sigmoidal node and the cost function was the two-class 


9 


http: //y ann. lecun. com/exdb/mnist/. 
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cross-entropy function (Eq. (18.44)). For the training, the dropout method was used with the prob- 
ability of discarding hidden neurons being equal to 0.3. The Adam optimizer was employed for 
training (Problem 18.21). The input images were normalized in the range [—1, 1]. The number of 
images used for training was 60000 and the size of the minibatches equal to 100. 

Fig. 18.44 shows examples of generated images by the generator after 1, 20, and 400 epochs of training. 

Observe how the quality of the fake images improves as the training algorithm converges. 
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FIGURE 18.44 

Fake images produced at the output of the generator network after training for (A) 1, (B) 20, and (C) 400 epochs, 
respectively. 


18.15.6 VARIATI0NAL AUT0ENC0DERS 

At the heart of GANs lies the concept of generating random samples, x, by exciting the generator 
network with noise random samples, z. Variational autoencoders (VAEs) follow a similar concept; 
yet the generator, known as decoder in this case, is excited by random variables whose PDF has been 
learned from the data. The training builds upon Bayesian learning arguments (see Chapters 12 and 13). 
Variational autoencoders were proposed in [120,190]. 

The basic assumption behind VAEs is that we are given a set of observations, x n , n — 1.2,..., /V, 
which are samples of a random vector, xeR f . Moreover, the latter is the resuit of another process 
that involves a set of continuous latent random variables, z e R"', where usually m <5C /. That is, the 
underlying generation mechanism evolves along the following steps: 

• A sample z. n is generated according to a prior PDF, p(z\ 0 ). where 6 is a set of unknown parameters. 

• A sample x n is generated from a conditional PDF, p(x\z n \ 9). 

Both z n , n — 1, 2,..., N, and 6 are unknown and have to be learned from the available observations 
x„, n — 1, 2,..., N. As we know from the Bayesian-related chapters, a way to estimate a set of param¬ 
eters, 6, in the presence of latent (unobserved) variables is via the EM algorithm. A prerequisite for 
its application is to know the posterior, p(z |x); this is needed for averaging out the (unknown) latent 
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variables. In general, this posterior is not available (or the required by the EM computations may not 
be computationally tractable) and one has to resort to approximations. In Chapter 13, a factorized form 
was adopted in the framework of the mean held approximation. In contrast, according to the VAEs, the 
approximation q(z |jc ; <j>) of the posterior is implemented via the use of a neural network; the vector <f> 
denotes the set of the unknown parameters that have to be learned together with 0. 

In the VAE jargon, p(z\x; <b) is known as the probabilistic encoder, indeed, given an observation 
x , it produces a distribution (e.g., a Gaussian) over the possible values of z, which is also known as 
the code. On the other hand, p(x\z.'. 0) is referred to as the probabilistic decoder ; given a code z, it 
produces a distribution over the possible corresponding values of x. The major difference with the 
autoencoders considered in Section 18.15.4 is that in the variational autoencoder context, the encoder 
is not designed to output a single value for each latent variable; in contrast, it is designed to provide a 
probability distribution for each latent attribute. 

The concept behind VAEs should be ciear by now. Once we learn the parameters associated with the 
approximate models of the posterior and the conditional, <j> and 6 , then we can generate new samples. 
Generate z„ according to the posterior and then use the generated sample, together with the conditional, 
to produce the corresponding x n . All that remains for us now is (a) to adopt the explicit parametric 
models for the conditional p(x\z\ 0) and for the approximation to the posterior q(z\x: <j>) and (b) to 
establish the method for estimating the unknown parameters. 

The parametric models : In [120], the following model is proposed: 

• For the prior, the Standard multivariate Gaussian is adopted, i.e., p(z) = A r (z: 0, 7). Thus, according 
to such a choice, the prior does not depend on any parameters. 

• For the posterior approximation q{z\x\ <j>), the multivariate Gaussian is also selected, with diagonal 
covariance matrix (U n = diagfcr^ (i )}"=,), i.e., 

q{z\x n \ <t>)=Af(z\ p n , Z„). 


The crucial point here is that the above mean values and variances are provided as outputs of the 
encoding network; in the simplest case, a network of a single hidden layer is employed (deeper 
networks and other variants can also be considered). More specifically, 

h„ = tanh (W\x n + b\), 
p n = W 2 h„ + b 2 , 
ln cr 2 n = W 2 h n +b 2 , 

where ln<r~ in the last equation denotes an element-wise action on the vector comprising the vari¬ 
ances. Thus, the parameters <j> comprise the set of all the parameters that detine the network as well 
as the linear combiners, i.e., Wj, W 2 , W 3 , b \, b 2 , b 2 , which are matrices and bias vectors of appro- 
priate dimensions. Note that in order to guarantee nonnegative values for the variance, we model its 
logarithm. Then, its value can be obtained via the exponent operation. 

• For the conditional PDF, if it is also Gaussian, a similar model as the one above is used, where the 
roles of z and jc are swapped and 0 comprises the parameters of the associated decoder network. 
If, on the other hand, the observations are of a discrete nature, a different model should be used. 
For example, for binary images, a Bernoulli distribution can be employed to implement the decoder 
network (e.g., [ 120 ]). 
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FIGURE 18.45 

The encoder network outputs the mean values and variances of the multivariate Gaussian posterior that describes 
the latent variables, which correspond to the current input. The decoder network, is excited by samples of the latent 
variables, drawn according to their posterior estimate, to generate samples similar to the original input. 

Fig. 18.45 illustrates the VAEs architecture. For each observation x n , the encoder’s network outputs 
the corresponding mean values and variances, which deline the specific posterior multivariate Gaussian 
PDF of the respective latent variables, z„ ■ Sampling from the posterior PDF, one can generate samples, 
according to the conditional, at the output of the decoder network that match the original input. 

The co st junctiori: Following the arguments in Chapter 13, ideally one should maximize the likelihood 
of the observations with respect to the unknown parameters. Flowever, the computation of the likeli¬ 
hood is not tractable for this case. Instead, we will maximize a related lower bound of it, i.e., Eqs. (13.1) 
and (13.2). Assuming the observations to be i.i.d. and adjusting to the current notational context, this 
lower bound, T ', becomes 


N N 

\n p(xi, X2, ■ ■ ■, x n ; 0) = ^ ln p(x n \ 9) > ^ iF(9, <t>:x n ), 

n =1 «=1 


where 

x n ) :=E g [\np{x„,z\ 9) - \nq(z\x n : 0)]. 

After applying the Bayes theorem on the joint distribution term and using the definition of the KL 
divergence, we get 


F{0, 0; x n ) = -KL(< 7 (z|x„; 0)|| p(z: 9)) + E ? [\np(x n \z- 9)]. 

Once the parametric forms of the involved distributions have been adopted, all one has to do is to 
maximize T with respect to the unknown parameters. Taking into account that the adopted models are 
feed-forward neural networks, the necessary gradients for the optimization task are computed in the 
backpropagation rationale. To this end, several “tricks” are in order, and the details can be found in, 

e-g., [120]. 

The lower bound involves two terms. The second term on the right-hand side is the expected value 
over the latent variables. To maximize the bound, this term should be maximized for the respective 
observations, x n ; this encourages the decoder to learn to reconstruet the data. The first term is negative 
(the KL divergence is a nonnegative quantity). To maximize the bound, the KL should be minimized; 
thus, the corresponding posterior will remain close to the prior; the latter has been chosen to be the Stan¬ 
dard multivariate Gaussian, A/"(0, /). Hence, the prior acts as a regularizer that “pushes” the network’s 
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latent variables to match the Standard Gaussian distribution as closely as possible. The existence of this 
terni is very important. The regularizer “forces” the PDF of the latent variables to be broad enough. 
We do not want, for example, two different images from the same class to be mapped to two different 
regions of the latent space; i.e., we do not want their corresponding PDFs to be very narrow, with small 
variance, placed at mean values that are far away. Our desire is to map/code every image of the same 
class via latent variables that lie “close” enough in the same region of space. 

The reparameterization trick : Before, it was mentioned that in order to apply the backpropagation 
algorithm some “tricks” are required. A major problem in the current setting, when dealing with the 
backpropagation for gradient computations, is that the decoder is excited by random samples. However, 
the dependence of the random samples on the unknown parameters is not explicit, e.g., via a function 
dependence. Their dependence is implicit via the respective PDF. In other words, if we draw samples 
from, say, a Gaussian, the sample values are not explicitly expressed in terms of the respective mean 
value and variance, although, of course, they depend on them. Undoubtedly, it is counterintuitive to 
try to compute derivatives of a random variable with respect to parameters that define the associated 
distribution. 

The previous difficulty is bypassed by the so-called reparameterization trick. The idea is to view the 
latent variables as being deterministic and express them as a function in terms of the parameters, which 
define the respective distribution, and of an independent auxiliary random variable that takes care of 
the randomness. For example, in the case of Gaussians, these parameters are the mean values and 
variances (covariance matrix). The auxiliary random variable can follow a simple PDF (e.g., Gaussian 
or uniform). It suffices for the function to be differentiable. In our context, the implied transformation 
is a linear one, and each component of the latent variables is expressed as 

z i = m + cxe, i = 1, 2 ,..., m, 

where e is the auxiliary random variable, with zero mean and unit variance. This transformation makes 
it possible to compute derivatives with respect to the parameters of the distribution, while stili main- 
taining the ability to randomly sample from that distribution. 

Although VAEs are easier to train compared to GANs, experimental evidence indicates that when 
VAEs are used for generating images, the obtained images tend to be more blurred compared to the ones 
that are produced via GANs. In [234], the Wasserstein distance is used to replace the KL divergence 
in the cost function for training VAEs and it is reported that this has a beneficial effect on the obtained 
performance. Another line of research that is currently pursued is along the convergence of GANs and 
VAEs, in an effort to combine the benefits of both (see, e.g., [155,159]). 

At the time the current edition of the book is being compiled, this is an ongoing active research 
field. 


18.16 CAPSULE NETWORKS 

The convolutional neural networks, treated in Section 18.12, and their many variants are currently state 
of the art in a diverse range of applications. Yet, in spite of their successes, CNNs are not beyond 
shortcomings. The need for large data sets for their training is a major one. One of the reasons for 
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the requirement of large training sets is that it is not easy for CNNs to leam and extrapolate geomet- 
ric relationships to new viewpoints. For example, at the end of Section 18.12.1, it was pointed out 
that pooling introduces invariance to small translations. However, in order to deal with more general 
invariances, such as scaling and rotation, one has to train the network on different viewpoints of the 
same object, which necessarily leads to an increase of the required size of the data set. Furthermore, 
the subsampling associated with the pooling “throws away” information with respect to the position of 
an entity within an area in the input image; this can affect information concerning the precise spatial 
relationships that are critical for recognition of objects. For example, the spatial relationships among 
the eyes, the mouth, and the nose are important in recognizing a face. To this end, a number of variants 
have been suggested in order to embed into a network affine transformation stages, which are learned 
during training, in an attempt to robustify it and boost its transformation invariance properties (see, 
e.g., [107] and the references therein). 

In [202], an alternative concept was introduced as a way to overcome the previous drawbacks. 
The proposed architecture builds upon the notion of convolutions as in CNNs, but it further embeds 
a new type of layers based on the concept of capsules. The term “capsule” refers to a group of scalar 
activations that are collectively combined to form an activity vector. Fig. 18.46, inspired by [202], is an 
example of such an architecture. To grasp the main rationale behind the capsule networks, let us follow 
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FIGURE 18.46 


The input is a 28 x 28 image. Using 9x9 convolutions with stride s = 1 and 256 channels, the produced volume in 

the first layer is of size 20 x 20 x 256. The convolutions performed for the second layer are 9 x 9 with stride s = 2. 

They are combined together in groups of 8, to form 32 6 x 6 x 8 volumes. 

step-by-step the various blocks in the figure. 

• Input : The inputs to the network are 28 x 28 images from the MNIST database. 

• First convolutional layer. This is a Standard convolutional layer. The filter matrix (kernel) is 9 x 9, 
the stride is s = 1, and the depth (number of channels) is d = 256. Thus, the size of the output 
volume of the first hidden layer is k x k x 256, where k = [=^j—^ + 1J = 20. 

• Primary capsules: This is also a convolutional layer. The difference with the previous one is that in 
this layer a grouping of activities takes place. The filter matrix for the convolutions is also 9x9, 
the stride is s = 2, and the depth is d = 256. However, the output channels are clustered together in 
groups of 8. Thus, 32 volumes are formed, each of depth 8 (total 256) (see Fig. 18.46). Each one of 
the volumes has size k x k x 8, where k — [^^ + 1] = 6. 
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FIGURE 18.47 

For each position/pixel (i, j ) of the 6 x 6 frontal image grid, a capsule is formed of dimension 8. It comprises the 
eight elements of the depth of the volume at the respective grid position, i.e., (i, j, k ), k — 1 , 2 ..., 8. 


Fig. 18.47 illustrates how the capsules are formed. For the example of the figure, each capsule is of 
dimension D = 8. For each one of the 32 volumes, 36 (6 x 6) eight-dimensional tubes/vectors are 
formed. In other words, each capsule corresponds to a specific location in the feature map arrays. 
Hence, each one of the eight dimensions in a capsule enccipsulates activity within a corresponding 
receptive field in the volume of the previous layer (see Fig. 18.46). Thus, a total of 36 x 32 = 1152 
capsules are obtained, each one of dimension equal to 8. These vectors comprise the outputs of 
the second layer. Let us denote each one of the produced capsules as i = 1,2,...,/, where 
/= 1152. 

• Digit capsules: This layer comprises three stages, which are illustrated in Fig. 18.48. 

- Stage 1: In this stage, affine transformations are performed on each one of the input (to this 
layer) capsules, i.e., 


Uj\i = WijUj, 1 = 1,2,...,/, j — 0, 2,..., 9, 

and Wjj e R 16x8 . In words, each one of the input capsules is transformed to ten 16-dimensional 
vectors. The number “ten” corresponds to the number of the classes, one per digit. Hence, the 
total number of performed transformations is 10/. The Wjj matrices are learned during training 
and their role is to establish spatial and other types of relationships between the lower-level 
features (as they are encoded in the input capsules) and the higher-level ones. Simply stated, the 
input information is “optimally” transformed to “match” the higher-level information, which 
in our case are the ten classes. The Wjj matrices could be seen as the model to establish a 
“part-to-whole” relationship; the classes can be viewed as the “whole” and the input capsules as 
the “parts,” each encoding individual features, e.g., type of a stroke, its thickness, and its width, 
which form the input images; however, these images may have been, for example, scaled and/or 
rotated and the role of these transformations is to take care of such variants. 

- Stage 2: The obtained transformed capsules, for each one of the classes, are then combined 
together via a weighted sum using coupling coefficients to form ten (one per class) vectors/cap- 
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Digit capsules 



Outputs 


FIGURE 18.48 

This layer consists of three stages. In the first one, ten (one per class) affine transformations are performed on each 
one of the input capsules, which were formed in the previous layer. In the next stage, a weighting sum is computed, 
over the transformed capsules for each one of the ten classes. At the final stage, the obtained vectors are “pushed’' 
through a squashing nonlinearity to generate the ten output capsules. The input capsules are of dimension 8 and the 
output ones of dimension 16, using transform matrices, Wjj , of appropriate dimensions. 


sules. 


1 

s j = Y J c ijUj\i’ j =0,1,...,9. (18.88) 

i=l 

As we will see in stage 3, it is these capsules that go through the nonlinearity to produce the 
outputs that finally define the winning class. Looking at Eq. (18.88) more carefully, the coupling 
coefficients c, / weigh the importance of each one of the involved transformed capsules while 
computing the corresponding s j. The nonnegative coupling coefficients have a probability inter- 
pretation and they add to one summing over j , for each fixed value of i. As will become apparent 
soon, their computation comprises a critical part of the capsule networks. 

- Stage 3: Once the s / s have been computed, they are “pushed” through the nonlinearity to provide 
the output capsules, Vj , j = 0, 2,..., 9. However, the nonlinearity has been carefully chosen, so 
that the lengths (Euclidean norms) of the output capsules are indicative of the class of the input 
image. For example, if the input digit is, say, “4,” the network is trained so that the length of the 
respective capsule, 03 , is the largest one compared to the other nine competitors. The idea is that 
each dimension (16 in total) of the output capsules represents different activities that are present 
in the input images. Hence, for the capsule that corresponds to the correct class, it is expected 
that ali dimensions “exhibit” high activity. 

The adopted nonlinearity is 


Si 


1 + 11® i 


J —, j = 0,1,2,..., 9. 


S/ 


(18.89) 


Observe that the nonlinearity affects the length but not the direction of the vector. For small 


values of I Is 


j i 













18.16 CAPSULE NETWORKS 1011 


For large values of | \s; 


II*; II 


=>■ 


1 . 


Note that here lies a major difference of the capsule networks with the more “traditional” neu- 
ral networks. Both the input and the output of the nonlinearity are vectors, instead of scalars. 
Activations are coded not only in terms of magnitude but also in terms of directions. 


TRAINING 

The training phase of the capsule networks evolves around the following interrelated concepts. 

The loss function bears a close similarity to the hinge loss function that has been used in the context 
of support vector machines (Section 11.10), and it is defined as 

9 

C = {Tj max{0, m + - \ \ vj \ \ 2 } + X(l - Tj) max{0,11 vj | | 2 - m “}}. (18.90) 

;=0 


The loss function comprises two terms. During training, if the input digit corresponds to, say, the 
kth capsule, then 7)- = 1 and Tj^k — 0. Thus, for minimum loss value, the length of the correct class 
capsule k will be adjusted towards values larger than a threshold, m + , whose value in practice is set 
equal to 0.9. In contrast, the lengths of the rest of the capsules should be “pushed” towards lengths 
smaller than m~, which in practice is set equal to 0.1. The parameter X is used to downweigh the 
contribution of the second term and in [202] it is suggested to be set equal to 1 /2. The loss value is 
used to adjust the values of the Wjj matrices and the coefficients of the filters used in the convolutions. 

The iterative dynamics routing algorithm refers to the update of the coupling coefficients. These 
coefficients have been “dressed” up with a probability interpretation, which is guaranteed by the fol¬ 
lowing softmax operation: 


Q / — 


exp (bij) 


IJ ELo ex Pfe) 


i = 1,2,...,/, j —0, 1,..., 9. 


In words, Cij represents the probability that the ith capsule of the previous layer is coupled to the /th 
output (class) capsule of the higher layer. The computation of the coupling coefficients takes place 
iteratively, according to Algorithm 18.7. 


Algorithm 18.7 (Dynamic routing algorithm). 

• Initialize bij = 0, i = 1,2,..., I, j = 0, 1,2,..., 9. 

• For r = 1,2, ..., R, Do; In practice, R = 3 iterations suffices. 
- For j = 0,1, 2,..., 9, Do 

* For i — 1,2,...,/, Do 

_ exp (bij) 

C ‘ J ELo ex P ( b ik) 


* End For 






1012 CHAPTER 18 NEURAL NETWORKS AND DEEP LEARNING 


* s j — 5Zj=l c ‘j**j\i 

, „ _ lisjll 2 Sj 

Vj ~ 1+IWlEIWII 

* For i = l,2,...,/, Do 

hij <- b u + v]u ni 

* End For 

- End For 

• End For 

• Return Vj, j = 0, 1_,9. 

A major feature in the algorithm is that the output capsules, vj, are produced by iteratively adjust- 
ing their similarity or agreement with the transformed outputs/capsules of the previous level. This is 
achieved via the updates of the bjj s, which are done in a way that “strengthens” or “weakens” their 
values according to the similarity (inner product) between vj and u j,. Note that every time iterations 
start by setting bjj = 0, i.e., they start from equiprobable values for the coupling coefficients. That is, 
each i has equal probability with respect to all classes. Then, their values are updated according to the 
matching between the computed outputs and the respective transformed input capsules. In words, the 
coupling coefficients route the information from the lower layer to the higher one. A large c,/ value 
means a stronger connection of the /th capsule of the lower level to the /th capsule of the layer above. 
It is important to emphasize that the values of the coupling coefficients are not adjusted by the back- 
propagation that tries to minimize the cost function. The cost function, via the backpropagation of the 
gradients, is used to adjust the elements of the Wjj matrices as well as the filters of the convolutions. 
The dynamic routing algorithm is the algorithm that produces the outputs. This is done both during 
training and during testing, given the values of the matrices Wjj. As indicated in Algorithm 18.7, it 
seems that three iterations suffice. 

In [202], a regularizer via reconstruction is also used combined with the loss in Eq. (18.90), to 
form the cost over the training data. During training, the output capsule Vk, associated with the correct 
class k, is provided as input to a fully connected network, which acts as a decoder to reconstruet the 
corresponding input. The loss function to train the fully connected network is the sum of squared 
differences between the input image pixels and the outputs, using sigmoid output nonlinearities. That 
is, the loss function that is used for training the capsule network becomes 

loss = C + a x reconstruction error, 
where C is given in Eq. (18.90) and a = 0.005. 

Experiments in [202] with the MNIST database for digit recognition verify that different dimen- 
sions (16 in total) of the output capsules are associated with different features of the input digits, e.g., 
thickness, skew, and width. In an extension of the original work in [93], EM-type arguments are em- 
ployed for the dynamic routing algorithm. 

It should be emphasized that capsule networks are stili an evolving methodology and it is early 
to discuss ciear statements concerning their performance. Although for small data sets, such as the 
MNIST, capsule networks have provided state-of-the art results, their performance on larger data sets 
is stili to be shown. 
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18.17 DEEP NEURAL NETWORKS: SOME FINAL REMARKS 

The topic of deep architectures has been one of intense research and of continuously increasing in- 
terest since the early years of this millennium. It would be impossible to cover in a single chapter all 
techniques, all algorithmic variants, and the related literature that have appeared, and in particular in 
the framework of specific applications. In the previous sections, major directions and algorithms were 
reviewed that constitute the spine around which a plethora of variants have been developed. Also, it is 
the author’s view that the previously discussed techniques and models constitute the basic knowledge 
that one has to grasp in a first reading to be able to proceed further. Some related tutorial papers are 
[40,90,133,205]. 

Below, some further concepts are highlighted, which are of high interest at the time the current 
edition of this book is compiled, and seem to offer ways to improve upon the performance of the basic 
notions and architectures that have been discussed before. 

TRANSFER LEARNING 

Transfer leaming is not a new idea; it has been around for many years. There are various ap- 
proaches and concepts that have been developed and applied in different applications (see, for example, 
[173,242,249]). Our aim here is to present the main concept and not to delve into a more detailed pre- 
sentation of the various techniques that have been proposed over the years. 

As has already been stated in a number of points throughout this chapter, deep neural networks 
currently represent the state of the art in machine learning. Deep networks have been applied in almost 
any scientific discipline and in many cases offer human or even superhuman performances (see, e.g., 
[114,197,229] for such example applications). 

However, training of such networks requires huge amounts of labeled data. Collecting and anno- 
tating data to build up big data sets is a rather painstaking task. Sometimes, large data sets do exist 
and are open and public, such as the ImageNet one [39], which contains 1.2 million images with 1000 
categories. However, often, they are proprietary or expensive to obtain. For example, this is the case for 
a number of speech-related data sets. Also, in a number of applications, obtaining big data sets is not 
possible, as for example in a number of medical-related applications, e.g., X-ray imaging and fMRI, 
one reason being that it is not easy to obtain large numbers of images corresponding to various types 
of a disease, e.g., malignant (cancerous) tumors, because a minority of people suffer from it. 

In contrast to this demand for huge training data sets, the human brain does not need to be trained 
from scratch every time it is faced with a new situation. Humans have an inherent mechanism to transfer 
knowledge and experience acquired from one situation (task) to another one that is sufficiently similar. 
Such mechanisms mainly operate at a subconscious level. For example, having learned how to bike, 
the related knowledge and experience can be transferred to learn how to ride a motorbike or a car. 

Transfer learning in machine learning is the methodology of bypassing the rationale of the isolated- 
task learning paradigm. Instead, the goal is to develop techniques so as to utilize knowledge learned 
from one task to another related one. 

To demonstrate the concept, let us focus on an example in the image recognition context. Assume 
that one is interested in training a network for automatic recognition and characterization of tumors in 
X-ray imaging. As said before, it is difficult to acquire a large amount of data for training. However, 
images share a number of features irrespective of their specific type. In other words, an image with 
a dog, or a car, or a tumor is formed by a combination of edges, primitive shapes, changes in light 
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intensity, etc. As a matter of fact, as we have already commented at the end of Section 18.12.3, the 
purpose of the various layers in a CNN is to leam representations related to this type of information 
and code it via the features that are formed hierarchically, layer after layer. Recalling what we have 
said in our discussion of CNNs, the convolutional layers learn the features. The features of the last 
convolutional layer are presented as input to the fully connected part of the network, whose output 
provides the associated prediction. 

In the context of our example, the idea of transfer learning is to “borrow” the architecture and the 
associated parameters of the convolutional layers , which have been previously learned by training via 
a large image database, e.g., ImagNet. Then, fix these parameters and train the network only for the pa¬ 
rameters comprising the fully connected part of the network (see Fig. 18.29), using the available X-ray 
images. In other words, we use the same features produced by training the network with images of a 
different kind. Then, these features are used to train the fully connected network using a task-specific 
database. Hence, we transfer knowledge previously acquired in an image-related task to facilitate the 
learning of a similar task. The first task, on which the network is trained, is usually referred to as the 
source task, and the second one, to which knowledge is transferred, as the target task. 

Another scenario is to use the previously learned parameters via the source task as initial values, 
instead of fixing them, and retrain the whole network using the X-ray images of the target task. Various 
scenarios and combinations can be devised, depending on the application and the data size of the target 
task. 

In summary, the main idea is this. Train a (or use a pretrained) deep network employing data ac¬ 
quired for a source task, for which the available data set is large enough for the respective training. Fix 
the parameters of the lower layers (or use them as initial values) and train for the parameters of the 
higher layers using the smaller-size data set of the target task (see, e.g., [189]). If the data size for the 
specific target application task is small, a simple single output layer can be used, e.g., a softmax or a 
linear SVM. 

MULTITASK LEARNING 

The information sharing between tasks in transfer learning takes place sequentially. One, first, learns 
the parameters via training on the source task; then, the obtained set of parameters is used for learning 
the target task, via its associated set of data. 

In the so-called multitask learning, a number of different tasks are learned simultaneously. Multitask 
learning can be considered as a type of inductive learning, where knowledge is transferred by embed- 
ding bias into the learning task. In a way, this is similar in concept with using a regularizer, whose 
purpose is to bias the solution towards the constraint that is imposed by the regularizer. In the context 
of multitask learning, the model for each one of the involved tasks is trained so as to bias its solution 
towards the rest of the tasks. It turns out that training networks in the multitask framework robustifies 
the network model against overfitting, when compared to model training on each task separately. 

As it was the case with transfer learning, the idea of multitask learning is not new; it has been 
around for a number of years in the context of different learners (see, e.g., [4,28]). Among the large 
number of techniques that have been proposed, two paths will be highlighted here that are more popular 
in the framework of deep networks (see e.g., [198] and [257] for more recent reviews). 

In the so-called hard parameter sharing approach, a number of hidden layers are shared by ali 
tasks. At the same time, a number of (output) layers are allowed to be task-specific. This is schemati- 
cally shown in Fig. 18.49. The parameters of ali the layers, both shared and task-specific, are learned 



18.17 DEEP NEURAL NETWORKS: SOME FINAL REMARKS 1015 



Input 


FIGURE 18.49 

In the hard parameter sharing multitask learning, a number of hidden layers are shared among the various tasks, 
while some higher-level layers (e.g., the output layer) are task-specific. 


simultaneously. In such a training concept, the overfitting tendency is reduced. This is natural, be- 
cause learning more than one tasks simultaneously makes the obtained representation able to capture 
information form all tasks, and hence the tendency for overfitting to any specific one is reduced. An 
application of this type of multitasking is, for example, in computer vision, where the goal of each 
task could be to predict the label of a different object. The setting is also very convenient when more 
than one object is present in an image and the task is to detect the objects’ class and their bounding 
boxes. 

Another path is the so-called soft parameter sharing multitask approach, which is illustrated in 
Fig. 18.50. According to this rationale, the models do not share parameters, yet their parameters are 
constrained to be similar. Similarity is measured according to some distance norm. For example, pa¬ 
rameters are constrained so that their distance according to the Euclidean £2 norm is smaller than a 
threshold. Other norms can be and have been used, too. 

GEOMETRIC DEEP LEARNING 

The setting of our discussion in this chapter referred to data that reside in Euclidean spaces. Indeed, 
all the examples we referred to were images, videos, and time series signals/sequential data. Also, 
the basic operation in CNNs was convolution, which was defined as an operation on images and time 
series. However, in many scientific fields the underlying structure is non-Euclidean. Such examples 
include information that resides in social networks, sensor networks in Communications, and brain 
networks in fMRI, to name but a few. Dealing with such types of data, we no longer have the regularity 
that underlies sequential data or images. The definition of convolution has to be modified and extended 
to include information which is related to the geometry of the non-Euclidean domains, such as graphs 
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Constrained 

layers 


FIGURE 18.50 

In the soft parameter sharing multitask learning, the constrained hidden layers exchange parameters; during training, 
care is taken so that the parameters of the layers for the different tasks, at the same level, are similar in some sense; 
for example, their 1 2 distance is small. 


and manifolds. The treatment of such data goes beyond the scope of this chapter. The interested reader 
can consuit, for example, the tutorial given in [23] and the references therein. 

OPEN PROBLEMS 

Undoubtedly, deep neural networks as a technology have revolutionized the discipline of machine 
learning. Yet, as we have commented in various parts of this chapter, they are not free of drawbacks. The 
reader should be aware and alert of these for two reasons. The first is that drawbacks, necessarily, define 
topics of research with a number of challenging and interesting problems. So, drawbacks have also their 
positive side; they constitute opportunities. Science and technology are evolutionary processes and no 
method or theory solves all the problems in one go. The second reason is that we should learn not to 
“worship” this new tool and consider it as a panacea that provides Solutions to all problems. 

As has already been stated in other parts of this chapter, deep neural networks have achieved accu- 
racy performance that matches that obtained by humans and in some cases superhuman performance 
has been demonstrated; yet, these reported results come from carefully curated data sets. At the same 
time, deep neural networks can be fooled by adversarial examples, while no judicious human would. 
Furthermore, such impressive performances may not hold true when such models are applied in the 
“wild” real world and the model is challenged to deal with conditions quite different from those repre- 
sented in the training set. Also, the reported accuracies are obtained at significantly high computational 
cost, using energy consuming devices, such as GPUs, and huge amounts of data are required for the 
training. Such data sets are, sometimes, open but in a number of cases are private property. 

Deep neural networks suffer from what we call interpretability of the obtained results. Although 
advances in this direction have been reported (see, for example, [170] and the references therein), the 
problem is far from solved. Any machine leaning system should be designed in a way that the user 







































18.18 A CASE STUDY: NEURAL MACHINE TRANSLATION 1017 


can follow the rationale of each prediction/decision, that is, to be able to answer questions of the type, 
“why is this person diagnosed with cancer?” 

Deep neural networks suffer from what is known as catastrophicforgetting [157]. When our human 
brain learns, say, task A, it can generalize and learn a second one, B, without forgetting A. Existing deep 
networks tend to forget the previous task A when they learn the new one, B. This is a major drawback 
for tasks where sequential or incremental learning is of importance, for example, in applications such as 
gesture recognition, network traffic analysis, or face and object recognition in mobile robots, where the 
network has to make updates on site and in time. This is an ongoing research topic (see, for example, 
[182] and the references therein). 

Deep neural networks suffer from significant parameter redundancy (e.g., [42]). It has been reported 
that, in some cases, more than 95% of the parameters can be removed without sacrificing performance. 
This is the basis behind the pruning techniques that have been reported in this chapter. This is also a 
topic of ongoing research (see, e.g., [142,174] and the references therein). 

Deep neural networks do not scale well for low-power devices, such as those needed in mobile 
phones, robots, and autonomous cars. Having to make billions of predictions per day has substantial 
energy costs. As commented before, GPUs are energy consuming devices. Real-time predictions are 
often orders of magnitude away from what deep neural networks can deliver. Thus, compression and 
efhciency comprise an important topic of interest in the deep learning research (see, e.g., [125,142,174] 
and the references therein). Federated learning, in the context of what is known as “AI at the edge”, is 
anotherpath in this direction, see, e.g, [20,178]. 

In a nutshell, although one should be careful in making predictions for the future, I can predict 
that, in a few years, a new edition will be needed to cover advances related to the ongoing research 
“happening” in this exciting field. 


18.18 A CASE STUDY: NEURAL MACHINE TRANSLATION 

Neural machine translation (NMT) is a subtopic of the natural language processing (NLP) discipline. 
The latter is broadly defined as the scientific area whose goal is the automatic manipulation (process, 
analyze, and synthesize) of natural languages, such as speech and text, via the use of computers. In 
Section 11.15, a case study on the closely related topic of authorship Identification was discussed. 
There, a number of terms such as bag-of-words and n-grams were defined and used in the context of 
SVM classification. 

In this concluding section of the current chapter, our goal is not to introduce the reader in the 
general and broad NLP field. After ali, such a treatment could justify a dedicated book. Our aim is 
more humble and we are only going to focus on some basic neural network models that can efficiently 
be employed in the framework of automatic translation of phrases from one language to another. The 
stage of our discussion is that of the RNNs, which have been introduced in Section 18.13. Like speech, 
the more general language processing/modeling lends itself naturally to be treated via RNNs, since any 
language is sequential in nature. Moreover, very often, the meaning of a word depends on the part in a 
phrase this is “located.” More importantly, the respective meaning may be in “context”; that is, it often 
depends on the other words that are present and comprise a specific phrase. 

In the sequel, we are only going to focus on some basic models. Once the basies have been grasped, 
one can search for more detailed and advanced techniques in the related literature. The source of 
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“inspiration” lies in the results summarized in some key papers in the area, such as [8,31,145,224] and 
the references therein. 

The starting point refers to the representation of each word in a computer. We will assume that each 
word corresponds to a single vector. For example, in the so-called one-hot representation , each word 
corresponds to a specific one-hot vector. The dimension / of these vectors is equal to the number of 
the words in the vocabulary that is used for the translation purposes. For example, 10000 words seems 
to be a reasonable size that suffices for many purposes and, of course, for everyday communication. 
By convention, each one of the, say, 10000 words is assigned a specific number/index between 1 and 
10000 and the corresponding one-hot vector, which represents it, has a 1 as its element in the respective 
position, while the rest of the elements are set equal to 0. There are other, more “fancy,” representations 
but this is out of our scope (see, e.g., [15]). It must be noted that the number of existing words in a 
language vocabulary can be much larger than the, say, 10000 that are used for practical purposes. For 
example, in the Greek language, being an ancient one, it is estimated that there are approximately 
100000 words and 300000 meanings. 

To formulate mathematically the automatic translation task, let us consider, as an example, the 
French phrase “Je suis etudiant,” the English translation of which is “I am a student.” The phrase in 
French consists of three words, each one represented by a one-hot vector, i.e., xi, X 2 ,xj, respectively. 
The breaking of a phrase into individual words is often referred to as tokenization and the individual 
words/vectors as tokens. Usually, the end of a phrase is indicated by a special Symbol, (EOS), which 
means “end of sequence,” and a special token can be reserved for it. Also, often, the beginning of a 
phrase is denoted as (SOS), i.e., start of sequence, which is usually a vector of zeros. Finally, since 
a word may not exist in the selected vocabulary, a token corresponding to an unknown word, e.g., 
(UNK), may also be used. Thus, the input sequence is 

(SOS), jci, xj, JC3, (EOS). 


For the one-hot representation, each one of the vectors is of the following form: 


x„ = [0,...,0, 1 ,0, ...,0] r eR*, n= 1,2,3, 


with the 1 being at the l n position, which is the corresponding index of the word in the respective 
vocabulary, assumed to be of size /. In a similar way, the output sequence, which is the translated 
phrase into English, is written as 

(SOS), jq, y 2 , y 3 , j 4 , (EOS), 

where the one-hot vectors of the output sequence indicate the corresponding index of the respective 
word in the English vocabulary. 

In practice, most often, the so-called embeddings are employed, instead of the one-hot-vectors. 
That is, each one-hot vector is projected into a lower-dimensional space via its multiplication with an 
embedding matrix, i.e., x„ = Ex n , where E is of appropriate dimensions and its elements are learned 
during training. Besides the reduction in the respective dimensions, which leads to a substantial reduc - 
tion of the associated parameters in the involved RNNs, such an embedding can also exploit further 
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semantic relations that exist among different words (see, e.g., [15]). To keep our discussion simple, let 
us stick to the one-hot representation. 

The first observation one can readily make is that the input and output sequences are of a different 
size. The input sequence is of length N = 3 and the output one of length N' = 4. Thus, using a Stan¬ 
dard RNN formulation, as illustrated in Fig. 18.34, is not appropriate, since, there, the input and output 
sequences are assumed to be of the same length. Hence, a modification to the Standard RNN formula¬ 
tion, as this was treated in Section 18.13, is required. Besides the appropriate model architecture, the 
second major goal is to adopt a suitable loss function that quantifies the fit and the “similarity” between 
the input and the output sequences. In the current setting, the first criterion that comes to mind is the 
conditional probability, 

P(yi,..-,y N '\xi,-.., x N)- (18.91) 

That is, from ali possible word combinations of size N' in the English (for our case) dictionary, the 
target sequence should be the one that maximizes the conditional probability, given the input (French) 
sequence. We will now provide answers on how to deal with the previous issues and come up with 
methods concerning (a) the design of appropriate architectures, (b) the related formulation for the 
optimizing criterion, and (c) an inference algorithm. 

Encoder-decoder design: Since the input and output sequences are of different size, a way out is to 
use two different RNNs. One will be dedicated to the encoding of the input sequence and the other to 



FIGURE 18.51 

The RNN encoder receives the input words sequentially and outputs the state vector at time instant N, after all the 
words have been presented to it. The initial state vector, ho, is set equal to zero. The summary vector, c, “summa- 
rizes” the information that resides in the input sequence and it is subsequently used to excite the decoder. 

the generation (decoding) of the output one. Recall that at the heart of an RNN lies the state vector, 
which, as stated in Section 18.13, encodes the history of the input sequence. The RNN encoder is 
shown in Fig. 18.51. It is basically the same as that in Fig. 18.34B, with one exception. There is no 
output sequence. Furthermore, we have used boxes instead of circles. This is to stress out that, often 
in practice, the LSTM version is used as the basic architectural unit. So, starting from a, usually, zero 
input state vector, ho, the output of the encoder is the so-called representation or summary or context 
vector, c, associated with the input sequence. In the simplest model, c is set equal to the state vector 
at time instant N, i.e., c = h y. Later on, we are going to discuss possible modifications on such a 
choice. Also, in practice, each one of the boxes in Fig. 18.51 may correspond to a multilayer LSTM, 
as discussed in Remarks 18.6. 

The RNN decoder is shown in Fig. 18.52. Looking at this figure, the following comments are in 
order. First, the context vector c is directly fed to each one of the RNN stages. There are variants to 
this. For example, in [224], c only feeds the initial state h' Q and is not utilized in the subsequent stages. 
That is, in this context, c is only used to “excite” the RNN decoder, instead of the zero value that is 
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FIGURE 18.52 

The RNN decoder uses the summary vector, c, which is provided by the encoder, and the respective inputs are 
the (delayed) words/vectors of the target output sequence. The decoder is trained to act as a predictor of the target 
output sequence. At each stage, it outputs a vector of softmax probability estimates for each one of the output vo- 
cabulary words, conditioned on the input sequence as well as the previous target output vectors/words (or previous 
decisions when it is operating in the inference phase). In the figure, we have used an output at time N' + 1, and the 
respective target Symbol during training is the (EOS) token. 


used to initialize /iq in the encoder part. Second, the output target values of the previous stage are used 
as an input to the next one. This is natural. The goal of the decoder is to act as a predictor. Having 
decided on the output word at stage n — 1, the goal is to predict the next word, at stage n. in the output 
sequence. The update of the state equation for each stage can be written as 

K = f(y n -l>K-V c )’ 

for some nonlinear function /, and the output is of the form 

y n = g(K)- (18.92) 

We have refrained the above equations from the dependence on the involved parameters (e.g., multi- 
plying matrices, related bias vectors; Section 18.13) for notational simplification and we only focus 
on the recursive dependence of the involved variables. Recall that the action of the nonlinear functions 
is element-wise upon their vector arguments. In some of the previously referenced papers, an explicit 
dependence on c and y n _ l is also given in Eq. (18.92). 

The nonlinear function, /, can be a simple element-wise sigmoid one, as in simple RNNs, or of a 
more complex nature, when the LSTM module is involved (see Section 18.13). The nonlinearity, g, is 
of the softmax type, involving / possible values. That is, y n is an /-dimensional vector that contains 
the predicted probabilities of selecting each one of the words in the available dictionary. Eventually, 
selection of the words is performed by making use of these predicted probabilities and the procedure 
is explained later on in the inference algorithm part. To make sure that there is no confusion with 
the notation, y n is not the one-hot vector associated with the predicted word. It is the output vector 
of probabilities at the nth stage of the RNN decoder, in line with the notation reserved for the RNN 
outputs in Section 18.13. 

Thus, we can write 

P(y n \y n -i,---,yi,Xi:N)*g(K), (is.93) 

where we used the Symbol X\ : n in place of the sequence x\,... ,xn for notational convenience. For 
the sake of clarity and avoiding possible confusion, note that the previous approximation in Eq. (18.93) 
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notationally means that on the right-hand side (vector), only the element corresponding to the index of 
the word y n is used. Recall that the conditional probability that is of interest to us is the joint conditional 
distribution in Eq. (18.91), where, after employing the product rule of probabilities (Section 2.2.2), it 
is written as 

N' 

P(y\ , ■ ■■, y N '\ x i-N) = P(y i\Xi :N ) P(y n \y n - 1 . jt, AYjv). (18.94) 

n=2 

Note that one can insert for each factor in Eq. (18.94) the corresponding terms as Eq. (18.93) suggests. 

The optimizing criterion: The goal is to maximize the conditional joint probability in Eq. (18.91). 
Considering the corresponding log-likelihood and taking into account Eqs. (18.93) and (18.94), this is 
equivalent to minimizing the respective cross-entropy over the whole length, N’ of the output sequence, 
i.e., 

N’ l 

j(0) = -J2J2y n iln y n i’ a 8 . 95 ) 

n =1 i = l 

where 0 includes all the unknown parameters that are involved both in the encoder as well as the de- 
coder part. Indeed, minimizing Eq. (18.95), one maximizes the likelihood at the target sequence, since 
y n i = 0 for any word with index other than the target output one and the probabilities are estimated via 
the softmax nonlinearity. 

In practice, the optimization takes place over all available phrases in the corpus and the cost can be 
written as 

N' l 

J( 0 ) — — ^2 ^2 m ym ln ym ’ 

QV,<Yw)s V n =1 1=1 

where V denotes the corpus and YV', Xn a ll the possible input-output sequences. 

Inference : Having learned the model, given an input phrase one has to decide on the output one. In 
theory, one should try in the decoder all possible combinations of words and sequence lengths and come 
up with the one that results in the highest joint conditional probability. Undoubtedly, this is impossible 
for vocabularies of the size that have been stated. In practice, suboptimal searching techniques are 
used. On the other extreme to the previously mentioned exhaustive search lie the greedy algorithms 
that decide on the optimal choice on a stage-by-stage fashion, starting at n — 1. Greedy algorithms 
have been used and commented on in the context of Adaboost (Section 7.10) and in the context of 
sparsity promoting optimization algorithms (Section 10.2.1). 

In a greedy algorithm setting, the optimization becomes computationally feasible, since each time 
optimization takes place with respect to a single word. For example, assume that at stage n — 1 a deci- 
sion has been reached. Then, the one-hot vector of this word is used in the place of y n _\ in the decoder, 
e.g., via hard thresholding of the respective vector of probabilities at stage n — 1 (in the case, of course, 
that one-hot representation is employed). By searching over all words in the dictionary, the winner at 
stage n is the one that scores the highest probability. The procedure is repeated until the final word 
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has been predicted; for example, an (EOS) token is detected. The major drawback of such suboptimal 
algorithms is that they do not consider word combinations, while optimizing stage-by-stage. 

Extensions of the previous basic greedy algorithm are usually used, as for example the beam search 
algorithm (e.g., [224]). According to this algorithm, more than one (conditional) probability values, 
e.g., the highest K , with K = 5 being a typical value, are selected at each stage and propagated through 
time. Although such an algorithm is stili suboptimal, in this way, the final decision is based on infor- 
mation that relies on successive Symbol combinations and not on each stage individually. 


RNN with attentiori mechanism : The rationale underlying the attention mechanism has been briefly 
introduced in Section 18.13.2. An interesting variant of the previously discussed RNN encoder-decoder 
basic model is to combine it with an attention mechanism technique (e.g., [8]). The corresponding 
architecture is illustrated in Fig. 18.53. In this variant, a bidirectional RNN has been employed for the 
encoder (see Remarks 18.6). There are two state vectors that are propagated, i.e., one in the forward, 

h „, and one in the backward direction, h n . 

In this case, the context vector becomes time dependent, i.e., c n ,n= 1,2,..., A', and it is defined 
as 


N 

C n — ^ ' Otnt h ', 
/=1 


where the weighting coefficients are given by 


&nt 


exp(e,„) 
Ek=l ex P ( e nk) 


with e nt = 4>(hJh' n _ x ). 


where 0 is a nonlinearity that is implemented via a neural network, which is “learned” during training, 

and li n is formed as the concatenation of the respective forward (h„) and backward (h „) state vectors. 
The variable e nt is known as the alignment model and quantifies how well the state vectors of the 
encoder, at time t, and the decoder, at time n, match. Related simulation results have been presented 
and commented on following Fig. 18.36. 

Remarks 18.10. 


• A similar modeling rationale has been used in the so-called language modeling. In this context, 
an RNN is trained as a predictor of the next word, given the previous one(s). The RNN learns 
the respective conditional probabilities. Once the model is trained, one can draw samples from the 
learned distributions and generate phrases starting from an initial word or a few words. 

• The rationale that was previously presented has also been used for image annotations. For example, 
one could feed the decoder with the vectorized output of a convolutional network that has been 
trained with images. In other words, the summary vector, c, is replaced by the output of the CNN; 
the latter, instead of being passed to a fully connected network for classification, as Fig. 18.29 
suggests, is used to excite the state h' Q of the decoder. The decoder is then trained to annotate the 
images accordingly (see, e.g., [153]). 

• Besides RNNs, other paths are also used for NMT. For example, CNNs have been employed in 
[55]. In [3,37,235] neither RNNs nor CNNs are used and the neural model is solely based on atten¬ 
tion mechanism concepts, and it is claimed that much longer word dependencies can be exploited 
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X i }*2 j*iV 


FIGURE 18.53 

A bidirectional RNN is used as an encoder with forward (h n ) and backward (h state vectors propagation. When 
an attention mechanism is used, at each stage, the decoder is fed with a different context vector, c„. The latter is a 
linear combination, over all time instants/stages of the encoder, of the combined state vectors (concatenation of h „ 

and h „). The coefficients a„;, i = 1,..., iV, are learned during training. The figure focuses on the nth stage of the 
decoder to unclutter the illustration and assumes N' stages in total. 


compared to LSTMs. In [43] and [254] further enhancements are reported via the use of pretraining 
techniques. At the time this edition is compiled, NMT is stili a field of ongoing intense research. 


18.19 PROBLEMS 

18.1 Prove that the perceptron algorithm, in its pattern-by-pattern mode of operation, converges in a 
finite number of iteration steps. Assume that 0 I()> — 0. 

Hint: Note that because classes are assumed to be linearly separable, there is a normalized 
hyperplane, 0%, and y > 0 , so that 

y<yt,0lx n , n = 1,2,..., N, 

where, y n is the respective label, being +1 for o>\ and —1 for o >2 ■ By the term normalized 
hyperplane, we mean that 

9l = [h,0 o*f, with ||0,|| = 1. 

In this case, y n 0l x„ is the distance of x„ from the hyperplane 0* [167]. 

18.2 The derivative of the sigmoid functions has been computed in Problem 7.6. Compute the deriva¬ 
tive of the hyperbolic tangent activation function (Eq. (18.8)), and show that it is equal to 

f'{z)=ac( 1 - f 2 (z)). 
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18.3 Show that the effect of the momentum terni in the gradient descent backpropagation scheme is 
to effectively increase the learning convergence rate of the algorithm. 

Hint: Assume that the gradient is approximately constant over I successive iterations. 

18.4 Show that if (a) the activation function is the hyperbolic tangent and (b) the input variables are 
normalized to zero mean and unit variance, then to guarantee that all the outputs of the neurons 
have zero mean and unit variance, the weights must be drawn from a distribution of zero mean 
and Standard deviation equal to 

a = m ~ L / 2 , 

where m is the number of synaptic weights associated with the corresponding neuron. 

Hint: For simplicity, consider the bias to be zero, and assume that the inputs to each neuron are 
mutually uncorrelated. 

18.5 Consider the sum of the squared errors cost function 

j N k L 

iynm y,un) 2 - (18.96) 

/2=1 m= 1 


Compute the elements of the Flessian matrix 


d 2 J 


(18.97) 


Near the optimum, show that the second-order derivatives can be approximated by 


d 2 J 


N k L 

= w 


9 y nm 9y n 


der der,., 90 ' kj 

kj k J /2 = 1 / 22=1 K J 


de 


k' j' 


(18.98) 


In other words, the second-order derivatives can be approximated as products of the first-order 
derivatives. The derivatives can be computed by following similar arguments as the gradient 
descent backpropagation scheme [74]. 

18.6 It is common when computing the Hessian matrix to assume that it is diagonal. Show that under 
this assumption, the quantities 

d 2 E 
d (0 k j) 2 ’ 


£ = E(/(4)-)y) 2 > 

222 = 1 

propagate backward according to the following: 
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18.7 Show that if the activation function is the logistic sigmoid and the loss function in (18.44) is 
used, then <5^. in (18.21) becomes 



‘V/ — a ( 9nj ynj ) • 


If the cross-entropy in Eq. (18.41) is used, then it is equal to 


&nj — a yn j (ynj 1 ) ■ 


18.8 Show that if the activation function in the output layer nodes is the logistic sigmoid and the 
relative entropy cost function is used, then ■ in Eq. (18.21) becomes 


^nj a ynj (^nj !)■ 


18.9 Show that the cross-entropy loss function depends on the relative output errors. 

18.10 Show that if the activation function is the softmax and the loss function is the cross-entropy (or 
the loss in (18.44)), then in (18.21) does not depend on the derivatives of the nonlinearity. 

18.11 As in the previous problem, use the relative entropy as the cost function and the softmax acti¬ 
vation function. Then show that 



18.12 Derive the backpropagation through time algorithm for training RNNs. 

18.13 Derive the gradient of the log-likelihood in Eq. (18.74). 

18.14 Prove that for the case of RBMs, the conditional probabilities are given by the following fac- 
torized form: 



18.15 Derive Eq. (18.83). 

COMPUTER EXERCISES 

18 . 16 Consider a two-dimensional class problem that involves two classes co\ (+ 1 ) and on (—1). Each 
one of them is modeled by a mixture of equiprobable Gaussian distributions. Specifically, the 
means of the Gaussians associated with wi are [—5, 5\ T and [5, — 5] r , while the means of 
the Gaussians associated with a >2 are [—5, — 5] r , [0, 0] r , and [5, 5] r . The covariances of ali 
Gaussians are a 2 /, where a 2 — 1. 
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(a) Generate and plot a data set X \ (training set) containing 100 points from u>\ (50 points 
from each associated Gaussian) and 150 points from a >2 (again 50 points from each asso- 
ciated Gaussian). In the same way, generate an additional set X 2 (test set). 

(b) Based onI|, train a two-layer neural network with two nodes in the hidden layer, each one 
having the hyperbolic tangent as activation function and a single output node with linear 
activation function, 10 , using the Standard backpropagation algorithm for 9000 iterations 
and step size equal to 0.01. Compute the training and test errors, based 011 X\ and X 2 , 
respectively. Also, plot the test points as well as the decision lines formed by the network. 
Finally, plot the training error versus the number of iterations. 

(c) Repeat step 18.16b for step size equal to 0.0001 and comment on the results. 

(d) Repeat step 18.16b for k— 1,4, 20 hidden layer nodes and comment on the results. 

Hint: Use different seeds in the rand MATLAB® function for the train and the test sets. To 
train the neural networks, use the newff MATLAB® function. To plot the decision region 
performed by a neural network, first determine the boundaries of the region where the data live 
(for each dimension determine the minimum and the maximum values of the data points), then 
apply a rectangular grid on this region, and for each point in the grid compute the output of the 
network. Then draw this point with different colors according to the class it is assigned to (use, 
e.g., the “magenta” and “cyan” colors). 

18.17 Consider the classification problem of the previous exercise, as well as the same data sets X \ 
and X 2 - 

Consider a two-layer feed-forward neural network as the one in (b) and train it using the adap- 
tive backpropagation algorithm with initial step size equal to 0.0001 and r,- = 1.05, r ( \ = 0.7, 
c = 1.04, for 6000 iterations. Compute the training and test errors, based on X\ and X 2 , respec¬ 
tively, and plot the error during training against the number of iterations. Compare the results 
with those obtained from the previous exercise (b). 

18.18 Repeat the previous exercise for the case where the covariance matrix for the Gaussians is 6/, 
for 2, 20, and 50 hidden layer nodes, compute the training and the test errors in each case, and 
draw the corresponding decision regions. Draw your conclusions. 

18.19 This is a classification-related exercise to gain experience by playing with the dropout regular- 
ization in the context of the MNIST 1 database. 

Train a feed-forward neural network using the TensorFlow machine learning framework. You 
are encouraged to use the Keras high-level API that is already implemented in TensorFlow 
for simplicity. 

(a) Load the data set where the 55000 images are used for training and the 10000 for testing. 
Normalize image values so that ali values are in the [0, 1] interval. In Keras, you may use 
the functions provided by the tensorflow.keras.datasets.mnist module. 

(b) Create a feed-forward neural network that consists of an input layer of 784 neurons (num¬ 
ber of pixels of the 28 x 28 input images), two hidden layers of 2000 neurons each, and 


10 The number of input nodes is equal to the dimensionality of the feature space. 

11 http://yann.lecun.com/exdb/mnist/. 

12 https://www.tensorflow.org. 

^ https://www.tensorflow.org/guide/keras. 
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an output layer of 10 neurons, where 10 is the number of ali possible digit classes. Use 
the ReLU activation function for all layers except for the output layer, where softmax 
activation is used. Train this network for 600 epochs using the gradient descent optimizer 
with constant learning rate 0.01. Employ the cross-entropy cost function. Use a minibatch 
size equal to 100 or any other value that fits your RAM. In Keras, you may define and 
train your network using the Sequential module. 

(c) Repeat step (b) using 50% dropout in the hidden layers. 

(d) Repeat step (b) using 50% dropout in the hidden layers and 20% dropout in the input 
layer. 

(e) For the models in (b)-(d), plot the number of classification errors in the testing set for 
each epoch. Comment on the results. 

18.20 The goal of this exercise is to gain experience in using different optimizers and sizes of con- 
volution kernels. Consider the CIFAR-10 common benchmark classification problem using 
TensorFlow. ! The task is to classify 32 x 32 pixel RGB images in 10 categories: airplane, 
automobile, bird, cat, deer, dog,frog, horse, ship, and truck. 

(a) Try to reproduce the results for the model described in the tutorial, which is described in 
the previous link given in the footnote. 

(b) Play with the hyperparameters of the model. For example, as a set of parameters (opti¬ 
mizer, initial learning rate , convolutional kernel size), use some of the following combi- 
nations: 

- optimizer: Adam and stochastic gradient descent. 

- initial learning rate: 0.1, 0.01, 0.001, 0.0001, 0.00001. 

- convolutional kernel size: 3, 5, or 7. 

(0 Comment on the results using the TensorBoard visualization tool provided by Tensor¬ 
Flow. 

Hint : You may use and edit the source code provided by the tutorial. Alternatively, you may 
use the Keras high-level API in order to define, train, and evaluate your model. 

18.21 This exercise focuses on the original GAN, which is trained on the MNIST database to “learn” 
to generate fake hand-written characters. 

To implement and train the network, use the TensorFlow framework. Moreover, as in the pre¬ 
vious exercise, you are encouraged to use the Keras high-level API for simplicity. 

(a) Load the data set of 60000 to be used for training. Normalize image values so that all 
values are in the [—1, 1] interval. In Keras, you may use the functions provided by the 
tensorflow. keras. datasets. mn ist module. 

(b) The generator is fed to its input with a noise vector and outputs an image. The dimension 
of the input noise vector is 100, and its elements are i.i.d. sampled from a normal dis- 
tribution of zero mean and unit variance. The generator consists of three fully connected 
layers with 256, 512, and 1024 neurons, respectively. The activation function used for the 
neurons in the hidden layers is the leaky ReLU with parameter a — 0.2. The output layer 
comprises 784 nodes and the tanh is employed as the respective activation function. 


14 https://www.tensorflow.org/tutorials/images/deep_cnn. 

^ https://www.tensorflow.org/guide/summaries_and_tensorboard. 
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(0 The discriminator takes as input a 1 x 784 vector, corresponding to the vectorized form 
of the 28 x 28 MNIST images. The discriminator consists of three fully connected layers 
with 1024, 512, and 256 neurons, respectively. The leaky ReLU activation function is 
also employed, with a — 0.2. During training, use the dropout regularization method, 
with probability of discarding nodes equal to 0.3. The output layer consist of a single 
node with a sigmoid activation function. 

(d) To train the implemented network, use the two-class (binary) cross-entropy loss function. 
Adopt the Adam minimizer with step size (learning rate) equal to 2 x 10 -3 , fii — 0.5, and 

— 0.999 as parameters for the optimizer. The recommended batch size is 100. Train 
the network for 400 epochs as follows. For each training loop, (a) generate a random set 
of input noise and images, (b) generate fake images via the generator, (c) train only the 
discriminator, and (d) then train only the generator, according to Algorithm 18.5. Play 
with the number of iterations, associated with the discriminator training. 

(e) During training and every 20 epochs, visualize the generated images created by the gen¬ 
erator and comment on the evolution of the learning process. 

(f) Play with all the various parameters that have been suggested above and see the effect on 
the training. 

18.22 This exercise focuses on the NMT task. For the purposes of this exercise, use the NMT tutorial 

provided by Tensorflow. !: 

(a) Install the tutorial and the instructions available at the provided link. Subsequently, train 
the default configuration of the model, which uses LSTMs as both the encoder and the 
decoder of the model with 128-dimensional hidden States and embeddings, and performs 
regularization via dropout. Embeddings are obtained through a simple matrix multiplica- 
tion of the one-hot vectors with a trainable weight matrix. The available data set pertains 
to a Standard German-to-English translation benchmark. 

(b) Run the inference algorithm, so as to obtain translations for the available test set. Select the 
greedy search algorithm, which simply picks at each stage the word corresponding to the 
highest predictive probability, paying no attention to the underlying temporal dynamics. 

(0 How does the performance change if you repeat the previous experiment using beam 
search with a beam width of K=5. What factor do you think contributes to the observed 
inferential improvement? 

(d) Flow does the performance change if you increase the beam size K = 1 0? What is your 
take-home lesson? 

(e) Increase the latent representation space dimensionality. Let us postulate hidden layers (as 
well as embedding layers) of two and three times the initially considered one. Hoes do 
training and decoding time change? What about the obtained accuracy? 

(f) The model comprises a lot of parameters. Therefore, one would expect it to overfit if no 
countermeasure is adopted. Here, the model uses simple dropout. What happens if you 
remove this regularization layer across the network? 
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https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmtl6_en_de.sh. 
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19.1 INTRODUCTION 

In many practical applications, although the data reside in a high-dimensional space, the true dimen- 
sionality, known as intrinsic dimensionality, can be of a much lower value. We have met such cases in 
the context of sparse modeling in Chapter 9. There, although the data lay in a high-dimensional space, 
a number of the components were known to be zero. The task was to learn the locations of the zeros; 
this is equivalent to learning the specific subspace, which is determined by the locations of the nonzero 
components. In this chapter, the goal is to treat the task in a more general setting and assume that the 
data can lie in any possible subspace (not only the ones formed by the removal of coordinate axes) or 
manifold. For example, in a three-dimensional space, the data may cluster around a straight line, or 
around the circumference of a circle or the graph of a parabola, arbitrarily placed in M 3 . In all previous 
cases, the intrinsic dimensionality of the data is equal to one, as any of these curves can equivalently 
be described in terms of a single parameter. Fig. 19.1 illustrates the three cases. Learning the lower- 
dimensional structure associated with a given set of data is gaining in importance in the context of big 
data processing and analysis. Some typical examples are the disciplines of computer vision, robotics, 
medical imaging, and computational neuroscience. 

The goal of this chapter is to introduce the reader to the main directions, which are followed in this 
topic, starting from more classical techniques, such as principal component analysis (PCA) and fac¬ 
tor analysis, both in their Standard as well as in their probabilistic formulations. Canonical correlation 
analysis (CCA), independent component analysis (ICA), nonnegative matrix factorization (NMF), and 





FIGURE 19.1 

The data reside close to (A) a straight line, (B) the circumference of a circle, and (C) the graph of a parabola in the 
three-dimensional space. In all three cases, the intrinsic dimensionality of the data is equal to one. In (A) the data 
are clustered around a (translated/affine) linear subspace and in (B) and (C) around one-dimensional manifolds. 
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dictionary learning techniques are also discussed; in the latter case, data are represented via an expan- 
sion in terms of overcomplete dictionaries, and sparsity-related arguments are mobilized to detect the 
most relevant atoms in the dictionary. Finally, nonlinear techniques for learning (nonlinear) manifolds 
are presented such as the kernel PCA, the local linear embedding (LLE), and the isometric mapping 
(ISOMAP) techniques. At the end of the chapter, a case study in the context of fMRI data analysis is 
presented. 


19.2 INTRINSIC DIMENSIONALITY 

A data set X C M 7 is said to have intrinsic dimensionality m < 1 if X can be (approximately) described 
in terms of m free parameters. Take as an example the case where the vectors in X are generated as 
functions in terms of m random variables, that is, x = g(uj,..., u,„), u, e R, i = 1,..., m. The corre- 
sponding geometric interpretation is that the respective observation vectors will lie along a manifold, 
whose form depends on the vector-valued function g : R"' i—> R/. Let us consider the case where 

T 

x = [r cosO, r sinO] , 

where r is a constant and the random variable 0 e [0, 27 t]. The data lie along the circumference of a 
circle of radius r and a single free parameter suffices to describe the data. If now a small amount of noise 
is added, then the data will be clustered close to the circumference, as for example in Fig. 19. 1B, and the 
intrinsic dimensionality is equal to one. From a statistical point of view, it means that the components 
of the random vectors are highly correlated. Sometimes, we say that the “effective” dimensionality is 
lower than the apparent one of the “ambient” space, in which the lower-dimensional manifold lies. 

In a more general setting, the data may lie in groups of manifolds or even in groups of clusters or 
they may follow a special spatial or temporal structure. For example, in the wavelet domain most of 
the coefhcients of an image are close to zero and can be neglected, yet the larger (nonzero) ones have 
a particular structure that is characteristic of natural images. Such a structured sparsity has been ex- 
ploited in the JPEG2000 coding scheme. Structured sparsity representations are often met in many big 
data applications (see, for example, [42]). In this chapter, we will only focus on identifying manifold 
structures, linear (subspaces/affine subspaces) in the beginning and nonlinear ones later on. 

Learning the manifold in which a data set resides can be used to provide a compact low-dimensional 
encoding of a high-dimensional data set, which can subsequently be exploited for performing Process¬ 
ing and learning tasks in a much more efficient way. Also, dimensionality reduction can be used for 
data visualization. 


19.3 PRINCIPAL COMPONENT ANALYSIS 

Principal component analysis (PCA) or Karhunen-Loeve transform is among the oldest and most 
widely used methods for dimensionality reduction [105]. The assumption underlying PCA, as well as 
any dimensionality reduction technique, is that the observed data are generated by a system or process 
that is driven by a (relatively) small number of latent (not directly observed) variables. The goal is to 
learn this latent structure. 
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Given a set of observation vectors, x n &WL 1 , n — 1,2,..., /V. of a random vector x, which will 
be assumed to be of zero mean (otherwise the mean/sample mean is subtracted), PCA determines a 
subspace of dimension m < Z, such that after projection on this subspace, the statistical variation of the 
data is optimally retained. This subspace is defined in terms of m mutually orthogonal axes , known 
as principal axes or principal directions, which are computed so that the variance of the data, after 
projection on the subspace, is maximized [95]. 

We will derive the principal axes in a step-wise fashion. First, assume that m — 1 and the goal is to 
find a single direction in M! so that the variance of the corresponding projections of the data points is 
maximized. Let u\ denote the principal axis. The variance of the projections (having assumed centered 
data) is given by 


1 i 

J(U\) = ~ ') 2 = Jj ^2( u ^ x n)(xJ,Ul) 

n =1 n =1 

= u\ Eu i, 

where 

1 N 

Z:=~J2 x » x n d9.D 

/ 1=1 

is the sample covariance matrix of the data. For large values of N or if the statistics can be computed, 
the covariance (instead of the sample covariance) matrix can be used. The task now becomes that of 
maximizing the variance. However, because we are only interested in directions, the principal axis will 
be represented by the respective unit norm vector. Thus, the optimization task is cast as 


(19.2) 

(19.3) 

This is a constrained optimization problem and the corresponding Lagrangian is given by 

L(u, A.) = u T Ilu — X(u T u — 1). (19.4) 

Taking the gradient and setting it equal to zero we get 

Eu=Xu. (19.5) 

In other words, the principal direction is an eigenvector of the sample covariance matrix. Plugging 
Eq. (19.5) into Eq. (19.2) and taking into account (19.3), we obtain 

u T Eu = X. (19.6) 

Hence, the variance is maximized if u i is the eigenvector that corresponds to the maximum eigenvalue, 

X i. Recall that because the (sample) covariance matrix is symmetric and positive semidefinite, all the 
eigenvalues are real and nonnegative. Assuming E to be invertible (hence, necessarily, N > l ), the 


u i = arg max u Eu, 

U 

s.t. u T u= 1. 
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eigenvalues are all positive, that is, /,] > /,2 > • • ■ A./ > 0, and we also assume they are distinet, in order 
to simplify the discussion. 

The second principal component is selected so that it (a) is orthogonal to u \ and (b) maximizes the 
variance after projecting the data onto this direction. Following similar arguments as before, a similar 
optimization taskresults with an extra constraint, u 1 U\ =0. It can easily be shown (Problem 19.1) that 
the second principal axis is the eigenvector corresponding to the second-largest eigenvalue, ko. The 
process continues until m principal axes have been obtained; they are the eigenvectors corresponding 
to the m largest eigenvalues. 

PCA, SVD, AND LOW RANK MATRIX FADTORIZATION 

The SVD decomposition of a matrix was discussed in Section 6.4. Given a matrix X e M . lxN , we 
can write 


X = UDV t . (19.7) 

For a rank r matrix X, U is the l x r matrix having as columns the eigenvectors corresponding 
to the r nonzero eigenvalues of XX T , and V is the N x r matrix having as columns the respec- 
tive eigenvectors of X T X\ D is a square r x r diagonal matrix comprising the singular values' 
< 7 i := i = 1,2,..., r. If we construet X to have as columns the data vectors x„, n = 1, 2,..., N, 
then XX 1 is a scaled version of the corresponding sample covariance matrix, E ; hence, the respective 
eigenvectors coincide and the corresponding eigenvalues are equal within a scaling factor (N). Without 
harming generality, we can assume XX 1 to be full rank (r = l < N ), and Eq. (19.7) becomes 


X = [«!, ..., M/] 

Ixl 


1 

... 

e 

1—1 '“H 

_1 

— [u 1, ■ 


VImi ■ 

• Vkll >\n • 

■ VklUjAT 

_ vw - 

IxN 

> 


_ VX/u/i . 

■ V% Vl n . 

• V^lVlN 


(19.8) 


Thus, the columns of X can be written in terms of the following expansion : 

l m l 

%n = ^ ^ ZnjUj = ^ ^ ZnjU-i T ^ ' Zni^ii 

i= 1 i=l i=m+ 1 


(19.9) 


where zj := [z n u • • • > z n i ] is the nth column of the l x N matrix on the right-hand side in Eq. (19.8). 
That is, by definition, z n i = Vh\v\ n and so on. The sum in Eq. (19.9) has been split into two terms, 


1 Because in some places we are going to involve the variance a 2 , we will carry on working with the square root of the 
eigenvalues, to avoid possible confusion. 

2 Note that what we have defined in previous chapters as the data matrix is the transpose of X. This is because, for dimensionality 
reduction tasks, it is more common to work with the current notational convention. If the transpose of X is used, the expansion 
of the data vectors is in terms of the columns of V and the analysis carries on in a similar way. 
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where m can be any value 1 < m < 1. Note that, due to the orthonormality of the h,s, 

Zni — ujx n , i = 1,2,... ,1, n = 1,2,..., N. 

From Section 6.4, we know that the best, in the Frobenius sense, m rank matrix approximation of 
X is given by 


X=[u\,...,u m ] 


Ixm 


\/Xi vj 


vj/ 


mxN 




!= 1 


(19.10) 


(19.11) 


Recalling the previous definition of z n i , the nth column vector of X can now be written as 

m 

Xn = y^ZniUj. (19.12) 

/=1 

Comparing Eqs. (19.9) and (19.12) and taking into account the orthonormality of M/, i = l,2...,/,we 
readily see that x n is the projection of the original observation vectors, x n , n — 1,2,..., N, onto the 
subspace span{r<i,..., u m } generated by the m principal axes of XX 1 (E) (Fig. 19.2). 

The previous arguments establish a bridge between PCA and SVD. In other words, the principal 
axes can be obtained via the SVD decomposition of X. Moreover, the columns of the best m rank 
matrix approximation, X, of X are the projections of the observation vectors x n on the (optimally) 
reduced in dimension subspace, spanned by the principal axes. 



FIGURE 19.2 

The projection of x n on the principal axis u\ is given by x n =z n \Ui, where z „i = ufx n . 
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Looking at Eq. (19.10), PCA can also be seen as a low rank matrix factorization method. Matrix 
factorization will be a recurrent theme in this chapter. Given a matrix X, there is not a unique way 
to factorize it, in terms of two matrices. PCA provides an m rank matrix factorization of X, by im- 
posing orthogonality on the structure of the involved factors. Later on, we are going to discuss other 
approaches. 

Finally, it is important to emphasize that the bridge between PCA and SVD establishes a connection 
between the low rank factorization of a matrix X and the intrinsic dimensionality of the subspace in 
which its column vectors reside, since this is the subspace where maximum variance of the data is 
guaranteed. 


MINIMUM ERROR INTERPRETATION 


Having established the bridge between PCA and SVD, another interpretation of the PCA method be- 
comes readily available. Because X is the best m rank matrix approximation of X in the Frobenius 
sense, the quantity 


N 


\\X-X\\ 2 f :=J]J]|io-,7)-A(/,7)| 2 = ^]||i„-x„|| 2 


] 


is minimum; that is, obtaining any other m-dimensional approximation (say, x n ) of x n , by choosing to 
project onto another m-dimensional subspace, would resuit in higher squared error norm approxima¬ 
tion, compared to that resulting from PCA. This is also a strong resuit that establishes a notable merit 
of the PCA method as a dimensionality reduction technique. This interpretation goes back to Pearson 
[146]. 


PCA AND INFORMATION RETRIEVAL 


The previous minimum error interpretation paves the way to build around PCA an efficient searching 
procedure in identifying similar patterns in large databases. Assume that a number N of prototypes 
are represented in terms of / features, giving rise to feature vectors, x n e R 7 , n = 1,2,..., N, which 
are stored in a database. Given an unknown object, which is represented by a feature vector x , the 
task is to identify to which one among the prototypes this pattern is most similar. Similarity is mea- 
sured in terms of the Euclidean distance ||jc — x„|| 2 . If N and / are large, searching for the minimum 
Euclidean distance can be computationally very expensive. The idea is to keep in the database the 
components z < „'" > := [z„i,..., z n m\ T (see Eq. (19.12)) that describe the projections of the N prototypes 
in spanjn |,..., instead of the original /-dimensional feature vectors. Assuming that m is large 
enough to capture most of the variability of the original data (i.e., the intrinsic dimensionality of the 
data is m to a good approximation), then z,', n) is a good feature vector description because we know 
that in this case x n ~ x n . Given now an unknown pattern, x, we first project it onto span{«i,..., 
resulting in 


m 


m 



( 19 . 13 ) 
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Then we have 


■ x \\ 2 = 


^ ' Z-niUi 


; = 1 


= llzi m) ' 


X>, 

i =1 


where z := [zi, ■ ■ ■, Z m ] 1 ■ In other words, Euclidean distances are computed in the lower-dimensional 
subspace, which leads to substantial computational gains (see, for example, [22,63,160] and the refer- 
ences therein). This method is also known as latent semantics indexing. 


0RTH0G0NALIZING PROPERTIES 0F PCA AND FEATURE GENERATION 

We will now shed light on PCA from a different angle. We have just discussed, in the context of 
the information retrieval application, that PCA can also be seen as a feature generation method that 
generates a set of new feature vectors, z, whose components describe a pattern in terms of the principal 
axes. Let us now assume (to make life easier) that N is large enough and the sample covariance matrix 
is a good approximation of the (full rank) covariance matrix E = IE[xx 7 ']. We know that any vector 
x e K / can be described in terms of u \,...,«/, that is, 

/ ; 

X = YziUj = y ^(uJx)Uj. 

i =1 i =1 


Our focus now turns to the covariance matrix of the random vectors, z, as x changes randomly. Taking 
into account that 

z i=ufx, (19.14) 

and the definition of U in Eqs. (19.7) and (19.8), we can write z = U 1 x, and hence 


E[zz t ] = e[u t xx t U 


= U T EU. 


However, we know from linear algebra (Appendix A. 2) that U is the matrix that diagonalizes E ; hence, 

E[zz 7 ’] = diag{ki,...,k / ). (19.15) 


In other words, the new features are uncorrelated, that is, 


E[ziZj] =0, j, ij = 1,2. I. 


(19.16) 


Furthermore, note that the variances of z,- are equal to the eigenvalues A.,-, i = 1,2,... ,1, respec- 
tively. Hence, by selecting as features the ones that correspond to the dominant eigenvalues, one 
has maximally retained the total variance associated with the original features, x,-; indeed, the cor- 
responding total variance is given by the trace of the covariance matrix, which in turn is equal to 
the sum of the eigenvalues, as we know from linear algebra. In other words, the new set of features, 
z,-, i = 1,2,..., m, represent the patterns in a more compact way, as they are mutually uncorrelated 
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and most of the variance is retained. It is common in practice, when the goal is that of feature genera- 
tion, for each one of the z, s to be normalized to unit variance. 

Later on, we will see that a more recent method, known as ICA, imposes the constraint that after 
a linear transformation (a projection is a linear transformation, after all) the obtained latent variables 
(components) are statistically independent, which is a much stronger condition than being uncorrelated. 


LATENT VARIABLES 

The random components z,, i = 1,2,..., m, are known as principal components. Sometimes, their 
observed values, zt, are known as principal scores. As a matter of fact, the principal components 
comprise the latent variables, which we mentioned at the beginning of this section. 

According to the general (linear) latent variable modeling approach, we assume that our / variables 
comprising x are modeled as 

x«Az, (19.17) 

where A is an l x m matrix and z e R m is the corresponding set of latent variables. Adopting the PCA 
model, we have shown that 

A — \u \,. .= U m , 

and the model implies that each one of the / components of x is (approximately) generated in terms of 
these mutually uncorrelated m latent random variables, that is, 

X( ^ u 1 1 zj —{—... T ui m z m . (19.18) 

Alternatively, in linear latent variable modeling, we can assume that the latent variables can also be 
recovered by a linear model from the original random variables, as for example, 

z—Wx. (19.19) 


In the case of the PCA approach, we have already seen that 

w = u£. 


Eqs. (19.17) and (19.19) constitute the backbone of this chapter, and different methods provide different 
Solutions for computing A or W. 

Let us now collect all the principal score vectors, z n , n — 1,2,.... A, as the columns of the m x N 
score matrix Z, that is, 


Z := [zi, ..., zjvL 


(19.20) 


Then (19.10) can be rewritten in terms of the score matrix 

X » U m Z. 


(19.21) 


Moreover, taking into account the definition of the principal components in Eq. (19.14), we can also 
write 



(19.22) 
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Remarks 19.1. 

• A major issue in practice is to select the m dominant eigenvalues. One way is to rank them in 

descending order, and determine m so that the gap between and 'k m+ \ is “large.” The interested 

reader can obtain more on this issue in [55,104]. 

• The treatment so far involved centered quantities. In case we want to approximate the original ob- 
servation vectors by taking into consideration the respective mean value of the data set, Eq. (19.13) 
is rephrased as 


m 



(19.23) 


/=i 

where x is the sample mean (mean if it is known) 



n =1 


and x denotes the original (not centered) vector. 

• PCA builds upon global information spread over ali the data observations in the set X. Indeed, the 
main source of information is the sample covariance matrix (XX T ). Thus, PCA is effective if the 
covariance matrix provides a sufficiently rich description of the data at hand. For example, this is the 
case for Gaussian-like distributions. In [41], modifications of the Standard approach are suggested 
in order to deal with data having a clustered nature. Soon, we are going to discuss techniques 
alternative to PCA in order to overcome this drawback. 

• Computing the SVD of large matrices can be computationally costly and a number of efficient 
techniques have been proposed (see, e.g., [1,83,194]). In a number of cases in practice, it turns 
out that l > N. Of course, in this case, the sample covariance is not invertible and some of the 
eigenvalues are zero. In such scenarios, it is preferable to work with an X T X (N x N ) instead of an 
XX T (/ x /) matrix. To this end, the relationships given in Section 6.4, in order to obtain u, from i>,-, 
can be employed. 

• The treatment of PCA bears a similarity with the Fisher linear discriminant (FLD) method (Chap- 
ter 7). They both rely on the eigenstructure of matrices that, in one way or another, encode (co)vari- 
ance information. However, note that PCA is an unsupervisecl method, in contrast to FLD, which 
is a supervised one. As a consequence, PCA performs dimensionality reduction so as to preserve 
data variability (variance) while FLD class separability. Fig. 19.3 demonstrates the difference in the 
resulting (hyper)planes. 

• Multidimensional scaling (MDS) is another linear technique used to project in a lower-dimensional 
space, while respecting certain constraints. Given the set A’cR / , the goal is to project onto a 
lower-dimensional space, so that inner products are optimally preserved; that is, the cost 



j 


is minimized, where z, is the image of x , and the sum runs over ali the training points in X. The 
problem is similar to PCA and it can be shown that the solution is given by the eigendecomposition 
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PCA 


FIGURE 19.3 

The case of a two-class task in the two-dimensional space. PCA computes the direction along which the variance 
is maximally retained after the projections of the data on it. In contrast, FLD computes the line so that the class 
separability is maximized. 


of the Gram matrix, K. := X T X. Another side of the same coin is to require the Euclidean distances, 
instead of the inner products, to be optimally preserved. A Gram matrix, consistent with the squared 
Euclidean distances, can then be formed, leading to the same solution as before. It turns out that the 
Solutions obtained by PCA and MDS are equivalent. This can readily be understood as X 1 X and 
XX T share the same (nonzero) eigenvalues. The corresponding eigenvectors are different, yet they 
are related, as we have seen while introducing SVD in Section 6.4. 

More on these issues can be found in [29,60]. As we will soon see in Section 19.9, the main 
idea behind MDS of preserving the distances is used, in one way or another, in a number of more 
recently developed nonlinear dimensionality reduction techniques. 

• In a variant of the basic PCA, known as supervised PCA [16,195], the output variables in regression 
or in classification (depending on the problem at hand) are used together with the input ones, in 
order to de termine the principal directions. 

Example 19.1. This example demonstrates the power of PCA as a method to represent data in a 
lower-dimensional space. Each pattern in a database, described in terms of a feature vector, x n e R*, 
will be represented by a corresponding vector of a reduced dimensionality, e R m , n = 1,2,..., N. 

In this example, each feature vector comprises the pixels of a 168 x 168 face image. These face images 
are members of the software-based aligned version [191] of the Labeled Faces in die Wild (LFW) 
database [102]. In particular, among the over 13,000 face images of this database, N — 1924 have been 
selected with criteria such as the quality of the image and the face angle (portraits were of preference). 
Moreover, the images are zoomed in order to omit most of the background. Examples of the face 
images used are depicted in Fig. 19.4 and the full collection of ali the 1924 images can be found in the 
companion site of this book. 


' In order to avoid confusion, recall that here X has been defined as the transpose of what we called a data matrix in previous 
chapters. 
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FIGURE 19.4 

Indicative examples of the face images used. 



FIGURE 19.5 

Examples of eigenfaces. 


The images are first vectorized (in M , / = 168 x 168 = 28, 224) and in the sequel are concatenated 
in the columns of the 28, 224 x 1924 matrix X. Moreover, the mean value across each one of the rows 
is computed and then subtracted from the corresponding element of each column. 

In this case, where / > N, it is convenient to compute the eigenvectors of X T X, denoted by r,-, i = 
l,..., N, and then the principal axis directions, that is, the eigenvectors of XX T are computed by «, oc 
Xvi (Chapter 6, Eq. (6.18)). These eigenvectors can be rearranged in a matrix form to give 168 x 168 
images, known as eigenimages. which in the particular case of face images are referred to as eigenfaces. 
Fig. 19.5 shows examples of eigenfaces resulting from the PCA of matrix X and specifically those 
corresponding, from top left to bottom right, to the lst, 2nd, 6th, 7th, 8th, lOth, llth, and 17th largest 
eigenvalues. 

Next, the quality of reconstruction of an original image, in terms of its lower-dimensional repre- 
sentation, is examined according to Eq. (19.13) for different values of m. As an example, the images 
depicting Marilyn Monroe and Andy Warhol, shown in Fig. 19.4, are chosen. The results are illustrated 
in Fig. 19.6. It is observed that for m — 100, or even better, for m — 600, the resulting approximation 
is very close to the original images. Note that exact reconstruction will be achieved when the full set 
of the 1924 eigenfaces is used. 

To put our previous findings in an information retrieval context, assume that one has available an 
image and wants to know what person is depicted in it. Assuming that the image of this person is in 
the database, the procedure would be (a) to vectorize the image, (b) to project it onto the subspace 
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FIGURE 19.6 

Image compression and reconstruction based on the first m eigenvectors. 


spanned by the, say, m = 100 eigenfaces, and (c) to search in this lower-dimensional space to identify 
the vectorized image in the database that is closest in the Euclidean norm sense. Usually, it is preferable 
to identify the, say, five or ten most similar images and rank them according to the Euclidean distance 
(or any other distance) similarity. Then, through the database, he/she can have the name and all the 
associated information that is kept in the database. 

In information retrieval, each one of the images in the database could be stored in terms of the 
corresponding vector of the principal scores. 

Example 19.2. In this example, the use of PCA for image compression is demonstrated. In the previous 
example, PCA was performed across the different images of a database. Here, the focus will be on a 
single image. 

The pixel values of the image are stored in an / x N matrix X and the columns of this matrix 
are considered to be the observation vectors x„ e M. 1 , n — 1,2,..., /V. Note that X needs to be zero 
mean along the rows so the mean vector, x, is computed and subtracted from each column. Then the 
eigenvectors corresponding to the m, 1 <m <1, largest eigenvalues are obtained either via the sample 
covariance matrix or directly through SVD. Exploiting the matrix factorization formulation of PCA in 
Eq. (19.22) a compressed representation of X, comprising m instead of l rows, is given by 

Z (m) = [«i,..., u m ] T X, (19.24) 

Y* ** 

mxl 


where the dimensionality m has been explicitly brought into the notation. Thus, only Z m> and 
«i,..., ii ln are needed to get an estimate of the, mean-subtracted, X via Eq. (19.21). Finally, in order 
to reconstruet the image, the mean vector x needs to be added back to each column (see Eq. (19.23)). 

The effectiveness of the PCA-based image compression will be demonstrated with the aid of the 
top-left image depicted in Fig. 19.7. This image is square, having / = N — 400. For any m chosen, 
the compression ratio is easily computed considering that instead of 400 x 400 values of the original 
image, after compression the storage of 2 x m x 400 values for the matrix Z (lll> and the eigenvectors, 
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Original Image MSE = 0.0168 Compressiori ratio = 36.36 : 1 




MSE = 0.0042 Compression ratio = 6.55 : 1 MSE = 0.0009 Compression ratio = 2.48 : 1 


FIGURE 19.7 

PCA-based image compression. The image is from the Greek island Andros. 


u i,..., u m , plus 400 values for the mean vector x is needed. This amounts to a compression ratio of 
400 : (2 m + 1). The reconstructed images together with the corresponding MSE between the original 
and the reconstructed image for different compression rates are shown in Fig. 19.7. 

Remarks 19.2. 

• Subspace tracking: Online subspace tracking is another old area with a revived interest recently. 

A well-known algorithm of relative low complexity, for tracking the signal subspace, is the 
so-called projection approximation subspace tracking (PAST), proposed in [197]. In PAST, the re- 
cursive least-squares (RLS) technique is employed for subspace estimation. Alternative algorithms 
in this line of philosophy have been presented in, for example, [69,115,170,179]. 

More recently, the work in [25,50,137] tackles the problem of subspace tracking with miss- 
ing/unobserved data. The methodology presented in [25] is based on gradient descent iterations 
on the Grassmannian manifold. Furthermore, the algorithms of [50,137] attempt to estimate the 
unknown subspace by minimizing properly constructed loss functions. 

Finally, [51,52,92,132,162] attack the subspace tracking problem in environments where obser- 
vations are contaminated by outlier noise. 
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PCA is a dimensionality reduction technique focusing on a single data set. However, in a number 
of cases, one has to deal with multiple data sets which, although they may originate from different 
sources, are closely related. For example, many problems in medical imaging fall under this umbrella. 
A typical case occurs in the study of brain activity where one can use different modalities, for example, 
electroencephalogram (EEG), functional magnetic resonance imaging (fMRI), or structural MRI. Each 
one of these modalities can grasp a different type of information and it is beneficial to exploit ali of 
them in a complementary fashion. Thus, the respective experimental data can appropriately be fused 
in order to get a better description concerning the brain activity that gives birth to the data. Another 
scenario where multiple data sets are of interest is when a single modality is used but different data 
are available measured on different subjects; thus, jointly analyzing the results can be beneficial for the 
finally reached conclusions (see, e.g., [56]). 

CanonicaI correlation analysis (CCA) is an old technique developed in [96] to process two data 
sets jointly. Our starting point is the fact that when two sets of random variables (two random vectors) 
are involved, the value of their correlation does depend on the coordinate system in which the random 
vectors are represented. The goal behind CCA is to seek a pair of linear transformations, one for each 
set of variables, such that after the transformation, the resulting transformed variables are maximally 
correlated. 

Let us assume that we are given two sets of random variables comprising the components of two 
random vectors, x e and yeR ? , and let the corresponding sets of observations be x„. y n , n = 
1, 2,..., N, respectively. Following a step-wise procedure, as we did for PCA, we will hrst compute a 
single pair of directions, namely, u x j, u y \, so that the correlation between the projections onto these 
directions is maximized. Let z A - i := u J x { x and z v i := u J x { y be the (zero mean) random variables after 
the linear transformation (projection). Note that these variables are the counterparts of what we called 
principal components in PCA. The corresponding correlation coefficient (normalized covariance) is 
defined as 


or 


where 


E[z, u z v ,i] E[u<J 1 x)(y 7 Mv , 1 )] 

y E [ z ii] E [z 2 v ,i] 7 E [(«J,i x ) 2 ] E [(«y, 1 y)2 ] 


P ■= 


H v 1 E X ylly\ 


\j (“.V. 1 ^XX^X,1 [ EyyUy, [ ) 


E 


x 

y 



EXX 

E X y 

2yx 

Eyy _ 


(19.25) 


(19.26) 


Note that, by the respective definition, we have E xy = E yx . When expectations are not available, co- 
variances are replaced by the corresponding sample covariance values. This is the most common case 
in practice, so we will adhere to it and use the notation with the “hat.” Furthermore, it can easily be 
checked that the correlation coefficient is invariant to scaling (changing, e.g., x -* bx). Thus, maximiz- 
ing it with respect to the directions u x \ and u y \ can equivalently be cast as the following constrained 
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optimization task: 

(19.27) 

(19.28) 

(19.29) 

Compare Eqs. ( 19.27)— ( 19.29) with the optimization task defming PCA in Eqs. (19.2) and (19.3). For 
CCA, two directions have to be computed and the constraints involve the weighted Y. norm instead of 
the Euclidean one. Moreover, in PCA the variance is maximized, while CCA cares for the correlation 
between the projections of the two involved vectors onto the new axes. 

Employing Lagrange multipliers, the corresponding Lagrangian of Eqs. ( 1 9.27)— ( 19.29) is given 
by 

L(U X , Myi h X , hy) = U x Jj X yUy ^ll ^ Jj XX U X 1^ y UyyUy 1^ . 

Taking the gradients with respect to u x and u y and equating to zero, we obtain (Problem 19.2) 

k X - A.y .- A. 

and 

^ X y u y = ^Ij xx U x , (19.30) 

-^y X M x = XSyyUy. (19.31) 


max ulZ XY u v , 

U X ,Uy ^ J 

s.t. u T x Y xx u x = 1, 

U T y SyyU y = 1. 


Solving the latter of the two with respect to u y and substituting into the first one, we finally get 


^xy^yy £yx u x — A 2 r„-tt, 


(19.32) 


and 


«v — x^yy ^y- xUx ’ 


(19.33) 


assuming, of course, invertibility of Y yy . Furthermore, assuming invertibility of F xx , too, we end up 
with the following eigenvalue-eigenvector problem: 


(KtZxytyjSyx) u x — X U x . 


(19.34) 


Thus, the axis u x \ is obtained as an eigenvector of the product of matrices in the parentheses in 
Eq. (19.34). Taking into account Eq. (19.30) and the constraints, it turns out that the corresponding 
optimal value of the correlation, p, is equal to 

P = U T xX t xy UyP = XU T xX £ XX U X p = X. 


Hence, selecting u x \ to be the eigenvector corresponding to the maximum eigenvalue, X 2 , results in 
maximum correlation. 
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The eigenvectors u x \, u y \ are known as the normalized canonical correlation basis vectors, the 
eigenvalue X 2 as the squared canonical correlation , and the projections z x j, z y \ as the canonical 
variates. 

The previous idea can now be taken further, and we can compute a pair of subspaces, span{« Tj i,..., 
u x m } and span{H v i,..., u y m ), where m < min( p, q). One way to achieve this goal is in a step-wise 
fashion, as was done for PCA. Assuming that k pairs of basis vectors have already been computed, the 
k + 1 is obtained by solving the following constrained optimization task: 



(19.35) 


(19.36) 

(19.37) 

(19.38) 


In other words, every new pair of vectors is computed so as to be normalized (Eq. (19.36)) and at 
the same time, each one is orthogonal (in the generalized sense) to those obtained in the previous 
iteration steps (Eqs. (19.37) and (19.38)). Note that this guarantees that the derived canonical variates 
are uncorrelated to ali previously derived ones. This reminds us of the uncorrelatedness property of 
the principal components in PCA. The only nonzero correlation in CCA, which is maximized at every 
iteration step, is the one between z x ^ — u T xk x and z y ^ = u T y k y, k = 1,2,..., m. 

More on CCA can be found in [6,26]. Extensions of CCA in reproducing kernel Hilbert spaces 
have also been developed and used (see, for example, [9,89,117] and the references therein). In [89], 
the kernel CCA is used for content-based image retrieval. The aim is to allow retrieval of images 
from a text query but without reference to any labeling associated with the image. The task is treated 
as a cross-modal problem. A probabilistic Bayesian formulation of CCA has been given in [15,113]. 
A regularized CCA version, using sparsity-based arguments, has been derived in [90]. In [65], a variant 
of CCA is proposed, named correlated component analysis ; instead of two directions (subspaces), 
a common direction is derived for both data sets. The idea behind this method is that the two data 
sets may not be much different, so a single direction is enough. In this way, the task has fewer free 
parameters to estimate. Moreover, the constraint on orthogonality is dropped, which in some cases 
may not be physically justifiable. A Bayesian extension of the method is provided in [147]. 

Example 19.3. Let x e M 2 be a normally distributed random vector, ,A/((), I). The pair of random 
variables (yi, y 2 ) is related to (xi, X 2 ) as 


0.7 0.3 
y_ 0.3 0.7 


Note the strong correlation that exists between the involved variables, because 


yi +y 2 = xi + x 2 . 


However, the cross-covariance matrix S yx , 


Sy X - AI - 


0.7 0.3 
0.3 0.7 
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indicates a rather low correlation. After performing CCA, the resulting directions are 


U t,l = «y,l = ~-^[l, 


which actually is the direction where the linear equality of the involved variables lies. The maximum 
correlation coefficient value is equal to 1, indicating strong correlation indeed. 


19.4.1 RELATIVES OF CCA 

CCA is not the only multivariate technique to process and deal with different data sets jointly. Vari- 
ous techniques have been developed, using different optimizing criteria/constraints, each one serving 
different needs and goals. 

The aim of this subsection is to briefly discuss some of these methods under a common framework. 
Recall that the eigenvalue-eigenvector problem for computing the pair of canonical basis vectors results 
from the pair of equations in Eq. (19.30) and (19.31). These can be combined into a single one [26], 
namely, 


Cu = XBu, 


(19.39) 


where 


and 


u:=[u T x ,u T y ] T 


C:= 

0 

^xy 

, B := 

'xx 

0 


. ^}' x 

0 


0 

^yy . 


Changing the structure of the two matrices, C and B, different methods resuit. For example, if we set 
C = E xx and B = I, we get the eigenvalue-eigenvector task of PCA. 

In [189], algorithmic procedures for the solution of the related equations, in a numerically robust 
way, are discussed. 


Partiat Least-Squares 

The partiat least-squares (PLS) method was first introduced in [186], and it has been used extensively 
in a number of applications, such as chemometrics, bioinformatics, food research, medicine, pharma- 
cology, social Sciences, and physiology, to name but a few. The corresponding eigenanalysis problem 
results if we set in Eq. (19.39) 


and keep C the same as for CCA. This eigenvalue-eigenvector problem arises (try it) if instead of 
maximizing the correlation coefficient p in Eq. (19.25), one maximizes the covariance, that is, 
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cov (z x j,z yj ) =E[z x jZyj]. (19.40) 

This means that while trying to reduce the dimensionality, not only our concern focuses on the corre- 
lation but at the same we want to identify directions that also care for maximum variance for both sets 
of variables. The optimizing task for identifying the first pair of axes, u x \, u y \, now becomes 


(19.41) 

(19.42) 

(19.43) 

PLS has been used both for classification and for regression tasks. For example, in Chapter 6, we used 
PCA for regression in order to reduce the dimensionality of the space and the least-squares solution 
was expressed in this lower-dimensional space. However, the principal axes were determined only 
on the basis of the input data so as to retain maximum variance. In contrast, PLS can be employed 
by considering the output observations as the second set of variables, and one can select the axes 
so as to maximize the variances as well as the correlation between the two data sets. The latter can be 
understood from the fact that maximizing the covariance (PLS) is equivalent to maximizing the product 
of the correlation coefficient (used for CCA) times the two variance terms. 

The literature on PLS is extensive and the method has been studied both algorithmically and from 
its performance point of view. The interested reader can obtain more on PLS from [153]. In ali the 
techniques we have discussed so far, a major focus is on computing the eigenvalues-eigenvectors. To 
this end, although one can use general packages and algorithms, a number of more efficient alternatives 
have been derived. A common approach is to solve the task in a two-step iterative procedure. In the 
first step, the largest eigenvalue (eigenvector) is computed, for which there exist efficient algorithms, 
such as the power method (e.g., [83]). Then, a procedure known as deflation is adopted; this consists of 
removing from the covariance matrices the variance that has been explained with the features extracted 
from the first step (see, for example, [135]). Kernelized versions of PLS have also been proposed (for 
example, [9,152]). 

Remarks 19.3. 

• Another dimensionality reduction method results if we set in Eq. (19.39) 


maximize 

U T x il X yUy, 

s.t. 

R 

R 

5-! 

II 


U T y Uy = 1 . 


O I 

The resulting method is known as multivariate linear regression (MLR). This is the task of finding 
a set of basis vectors and corresponding regressors such that the MSE in a regression problem is 
minimized [26]. 

• CCA is invariant with respect to affine transformations. This is an important advantage with respect 
to the ordinary correlation analysis (for example, [6]). 

• Extensions of CCA and PLS to more than two data sets have also been proposed (see, for example, 

[56,110,185]). 
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19.5 INDEPENDENT COMPONENT ANALYSIS 

The latent variable interpretation of PCA was summarized in Eqs. (19.1 7)—( 19.1 9), where each one 
of the observed random variables, x,-, is (approximately) written as a linear combination of the latent 
variables (principal components in this case), z/, which are in tum obtained via Eq. (19.19), imposing 
the uncorrelatedness constraint. 

The kick-off point for ICA is to assume that the following latent model is true: 

x=As, (19.44) 

where the (unknown) latent variables of s are assumed to be mutually statistically independent and 
we refer to them as the independent components (ICs). The task then comprises obtaining estimates 
of both the matrix A and the independent components. We will focus on the case where A is an / x l 
square matrix. Extensions to fat and tali matrices, corresponding to scenarios where the number of 
latent variables, m, is smaller or larger than the number of the observed random variables, /, have also 
been considered and developed (see, e.g., [100]). 

Matrix A is known as the mixing matrix and its elements, ajj, as the mixing coefficients. The result- 
ing estimates of the latent variables will be denoted as z,, 1 = 1 , 2 ...,/, and we will also refer to them 
as independent components. The observed random variables, x,-, i = 1,2,... ,1, are sometimes called 
the mixture variables or simply mixtures. 

To obtain the estimates of the latent variables, we adopt the model 

s:=z=VTx, (19.45) 

where W is also known as the unmixing or separating matrix. Note that 

z = WAs, 

and we have to estimate the unknown parameters, so that z is as close to s as possible. For square 
matrices, A = W~ l , assuming invertibility. 


19.5.1 ICA AND GAUSSIANITY 

Although in general in statistics adopting the Gaussian assumption for a PDF seems to be rather a 
“blessing,” in the case of ICA this is not true anymore. This can easily be understood if we look at the 
consequences of adopting the Gaussian assumption. If the independent components follow Gaussian 
distributions, their joint PDF is given by 


where for simplicity we have assumed that all the variables are normalized to unit variance. Let the 
mixing matrix. A, be an orthogonal one, that is, A -1 = A 1 . Then the joint PDF of the mixtures is 
readily obtained as (see Eq. (2.45)) 


(2^ 6XP 


11 A 7 


|det(A r )|. 


p(x) = 


2 


(19.47) 
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However due to the orthogonality of A, we have ||A 7 jc|| 2 = ||jc|| 2 and |det(A r )| = 1, which makes 
p(s ) indistinguishable from p(x). That is, no conclusion about A can be drawn by observing x, as all 
related information has been lost. Seen from another point of view, the mixtures x, are mutually uncor- 
related, as E x = I, and ICA can provide no further information. This is a direct consequence of the fact 
that uncorrelatedness for jointly Gaussian variables is equivalent to independence (see Section 2.3.2). 
In other words, if the latent variables are Gaussians, ICA cannot take us any further than PCA, be- 
cause the latter provides uncorrelated components. That is, the mixing matrix. A, is not identifiable for 
Gaussian independent components. In a more general setting, in a case where some of the components 
are Gaussians and some are not, ICA can identify the non-Gaussian ones. Thus, for a matrix A to be 
identifiable, at most one of the independent components can be Gaussian. 

From a mathematical point of view, the ICA task is ill-posed for Gaussian variables. Indeed, assume 
that a set of independent Gaussian components, z, have been obtained; then any linear transformation 
on z by a unitary matrix will also be a solution (as shown previously). Note that this problem is by- 
passed in PCA, because the latter imposes a specific structure on the transformation matrix. 

In order to deal with independence one has to involve, in one way or another, higher-order statistical 
information. Second-order statistical information suffices for imposing uncorrelatedness, as is the case 
with PCA, but it is not enough for ICA. To this end, a large number of techniques and algorithms 
have been developed over the years and reviewing all these techniques is far beyond the limits imposed 
on a book section. The goal here is to provide the reader with the essence behind these techniques 
and emphasize the need to bring higher-order statistics into the game. The interested reader can delve 
deeper in this field from [55,58,81,100,120]. 

19.5.2 ICA AND HIGHER-ORDER CUMULANTS 

Imposing the constraint on the components of z to be independent is equivalent to demanding all 
higher-order cross-cumulants (Appendix B.3) to be zero. One possibility to achieve this is to restrict 
ourselves up to the fourth-order cumulants [57]. As stated in Appendix B.3, the first three cumulants 
for zero mean variables are equal to the corresponding moments, that is, 

/ci(z,-) = E[z/] = 0, 

K 2 (Zi,Zj) =E[Z{Zj], 

K 3 (Zi,Zj,Zk)=E[ZiZjZ k ], 

and the fourth-order cumulants are given by 

K4(.Zi,Zj,Z k ,Z r )=E[ZiZjZ k Z r ]-E[ZiZj]~E[z k Z r ] 

- E[ziz k ] E[zjz r ] - E[z,z,.] E[zjz k ]. 

An assumption that is employed is that the involved PDFs are symmetric, which renders odd-order 
cumulants to zero. Thus, we are left only with the second- and fourth-order cumulants. Under the 
previous assumptions, our goal is to estimate the unmixing matrix, W , so that (a) the second-order and 
(b) the fourth-order cumulants become zero. This is achieved in two steps. 

Step 1 : Compute 


z = U T x, 


(19.48) 
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where U is the unitary / x / matrix associated with PCA. This transformation guarantees that the com- 
ponents of z are uncorrelated, that is. 


E[z,-Zj] = 0, i ^ j, i, j= 1, 2,..., /. 

Step 2: Compute an orthogonal matrix, U, such that the fourth-order cross-cumulants of the com- 
ponents of the transformed random vector 


z=C r z (19.49) 

are zero. In order to achieve this, the following maximization task is solved: 

1 

max Y k}( z;). (19.50) 

uu t =, tr 

Step 2 is justified as follows. It can be shown [57] that the sum of the squares of the fourth-order 
cumulants is invariant under a linear transformation by an orthogonal matrix. Therefore, as the sum 
of the squares of the fourth-order cumulants is fixed for z, maximizing the sum of the squares of the 
autocumulants of z will force the corresponding cross-cumulants to zero. Observe that this is basically 
a diagonalization problem of the fourth-order cumulant multidimensional array. In practice, this can be 
achieved by generalizing the method of Givens rotations, used for matrix diagonalization [57]. Note 
that the sum that is maximized is a function of (a) the elements of the unknown matrix U, (b) the 
elements of the known (for this step) matrix U, and (c) the cumulants of the random components 
of the mixtures x, which have to be estimated prior to the application of the method. In practice, 
it usually turns out that setting the cross-cumulants to zero is only approximately achieved. This is 
because the model in Eq. (19.44) may not be exact, for example, due to the existence of noise. Also, the 
cumulants of the mixtures are only approximately known, because they are estimated by the available 
observations. 

Once U and U have been computed, the unmixing matrix is readily available and we can write 

z = Wx = ( UU) T x , 

and the mixing matrix is given as A = W ~ 1 . 

A number of algorithms have been developed around the idea of higher-order cumulants, which 
are also known as tensorial methods. Tensors are generalizations of matrices and cumulant tensors are 
generalizations of the covariance matrix. Moreover, note that as the eigenanalysis of the covariance 
matrix leads to uncorrelated (principal) components, the eigenanalysis of the cumulant tensor leads to 
independent components. The interested reader can obtain a more detailed account of such techniques 
from [39,57,119]. 

ICA Ambiguities 

Any ICA method can (approximately) recover the independent components within the following two 
indeterminacies. 
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p(x) 



X 


FIGURE 19.8 

A Gaussian (full-gray line) a super-Gaussian (dotted red line) and a sub-Gaussian (full-red line). 

• Independent components (ICs) are recovered to within a constant factor. Indeed, if A and z are the 
recovered quantities by an ICA algorithm, then (1 /a)A and az is also a solution, as is readily seen 
from Eq. (19.44). Thus, usually the recovered latent variables (ICs) are normalized to unit variance. 

• We cannot determine the order of the ICs. Indeed, if A and z have been recovered and P is a 
permutation matrix, then AP~ l and P z is also a solution, because the components of P z are the 
same as those of z in a different order (with the same statistical properties). 


19.5.3 NON-GAUSSIANITY AND INDEPENDENT COMPONENTS 

The fourth-order (auto)cumulant, of a random variable, z. 



is known as the kurtosis of the variable and it is a measure of non-Gaussianity. Variables following 
the Gaussian distribution have zero kurtosis. Sub-Gaussian variables (variables whose PDF falis at a 
slower rate than the Gaussian for the same variance) have negative kurtosis. Super-Gaussian variables 
(corresponding to PDFs that fall at a faster rate than the Gaussian) have positive kurtosis. Thus, if we 
keep the variance fixed (e.g., for variables normalized to unit variance), maximizing the sum of squared 
kurtosis, it results in maximizing the non-Gaussianity of the recovered ICs. Usually, the absolute value 
of the kurtosis of the recovered ICs is used as a measure of ranking them. This is important if ICA is 
used as a feature generation technique. Fig. 19.8 shows some typical examples of a sub-Gaussian and 
a super-Gaussian together with the corresponding Gaussian distribution. Also, another typical example 
of a sub-Gaussian distribution is the uniform one. 

Recall from Chapter 12 (Section 12.8. 1) that the Gaussian distribution is the one that maximizes the 
entropy under the variance and mean constraints. In other words, it is the most random one, under these 
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constraints, and from this point of view the least informative with respect to the underlying structure of 
the data. In contrast, distributions that have the least resemblance to the Gaussian are more interesting 
as they are able to better unveil the structure associated with the data. This observation is at the heart 
of projectiori pursuit, which is closely related to the ICA family of techniques. The essence of these 
techniques is to search for directions in the feature space where the data projections are described in 
terms of non-Gaussian distributions [97,106]. 

19.5.4 ICA BASED 0N MUTUAL INFORMATION 

The approach based on zeroing the second- and fourth-order cross-cumulants is not the only one. An 
alternative path is to estimate W by minimizing the mutual information among the latent variables. 
The notion of mutual information was introduced in Section 2.5. Elaborating a bit on Eq. (2.158) and 
performing the integrations on the right-hand side (for the case of more than two variables), it is readily 
shown that 

l 

I(z) = -H(z) + £ff(z,-), (19.51) 

i =1 

where H( z,) is the associated entropy of z,-, dehned in Eq. (2.157). In Section 2.5 it has been shown 
that /(z) is equal to the Kullback-Leibler (KL) divergence between the joint PDF p(z) and the product 

l 

of the respective marginal probability densities, |~[ p/(z,). The KL divergence (and, hence, the associ¬ 
at 

ated mutual information / (z)) is a nonnegative quantity and it becomes zero if the components z, are 
statistically independent. This is because only in this case the joint PDF becomes equal to the product 
of the corresponding marginal PDFs, leading the KL divergence to zero. Hence, the idea now becomes 
to compute W so as to force I (z) to be minimum, as this will make the components of z as independent 
as possible. Plugging Eq. (19.45) into Eq. (19.51) and taking into account the formula that relates the 
two PDFs associated with x and z (Eq. (2.45)), we end up with 

/(z) = -ff(x)-ln|det(W)|-£ f pi(zi)ln Pi (zi)dzi. (19.52) 

i=l J 


The elements of the unknown matrix, W, are also hidden in the marginal PDFs of the latent variables, 
z, . However, it is not easy to express this dependence explicitly. One possibility is to expand each 
one of the marginal densities around the Gaussian PDF, denoted here as g(z), following Edgeworth’s 
expansion (Appendix B), and truncate the series to a reasonable approximation. For example, keeping 
the hrst two terms in the Edgeworth expansion we have 

Pi(zi) = g(zi) ^1 + ^3 (zi)H-i{zi) + ^K4(zi)H 4 (zi)^j , (19.53) 

where Hk{zi) is the Hermite polynomial of order k (Appendix B). To obtain an approximate expression 
for / (z), in terms of cumulants of z, and W. we can (a) insert in Eq. (19.52) the PDF approximation 
in Eq. (19.53), (b) adopt the approximation ln(l + y) ~ y — y 2 , and (c) perform the integrations. This 
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is no doubt a rather painful task! For the case of Eq. (19.53) and constraining W to be orthogonal, the 
following is obtained (e.g., [100]): 


/(z)^C 


/It 1 / i t 

J2 ( ^3 2 ( Z '') + ^f( Z i) + ^4( Z <) - ^/c|( z i )/C 4 (z J ) 


( = 1 


7 

48' 


(19.54) 


where C is a quantity independent of W. Under the assumption that the PDFs are symmetric (thus, 
third-order cumulants are zero), it can be shown that minimizing the approximate expression of the 
mutual information in Eq. (19.54) is equivalent to maximizing the sum of the squares of the fourth- 
order cumulants. Note that the orthogonal W constraint is not necessary, and if it is not adopted other 
approximate expressions for 7(z) resuit (for example, [91]). 

Minimization of 7(z) in Eq. (19.54) can be carried out by a gradient descent technique (Chapter 5), 
where the involved expectations (associated with the cumulants) are replaced by the respective instan- 
taneous values. Although we will not treat the derivation of algorithmic schemes in detail, in order to 
get a flavor of the involved tricks, let us go back to Eq. (19.52), before we apply the approximations. 
Because 77(x) does not depend on W, minimizing 7(z) is equivalent to the maximization of 


J(W) — ln |det(VF)| + E 


■ / 

^ln pi(zi) 
_i=i 


(19.55) 


Taking the gradient of the cost function with respect to W results in 

dJ(W) 


where 


and 


dW 


0(z) := 


= W~‘ -E[0(z)x'], 

p\( Zl) pfai) 

Pl ( z l) ’ ’ " ’ Pi(zi) 

dptizi) 


Pfai) ■= 


dzj 


(19.56) 


(19.57) 


(19.58) 


and we used the formula 


3det(VF) T 

- = W~ T dtl{W). 

dW 


Obviously, the derivatives of the marginal probability densities depend on the type of approximation 
adopted in each case. The general gradient ascent scheme at the /th iteration step can now be written 
as 

jy(') = 


or 


W 


(i) _ 


+ W (of (, '- 1) )- 7 ’-e[0(z)x 7 ']) 
= + n i (7 - E [0(z)z 7 ' j ( w (i ~ l) r T 


(19.59) 
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In practice, the expectation operator is neglected and random variables are replaced by respective 
observations, in the spirit of the stochastic approximation rationale (Chapter 5). 

The update equation in Eq. (19.59) involves the inversion of the transpose of the current estimate 
of W. Besides the computational complexity issues, there is no guarantee of the invertibility in the pro- 
cess of adaptation. The use of the so-called natural gradient [68], instead of the gradient in Eq. (19.56), 
results in 

W (i) = lT <i_1> + im (i - E[0(z)z r ]) VT (,_1) , (19.60) 

which does not involve matrix inversion and at the same time improves convergence. A more detailed 
treatment of this issue is beyond the scope of this book. Just to give an incentive to the mathematically 
inclined reader for indulging more deeply this field, it suffices to say that our familiar gradient, that 
is, Eq. (19.56), points to the steepest ascent direction if the space is Euclidean. However, in our case 
the parameter space consists of all the nonsingular / x / matrices, which is a multiplicative group. The 
space is Riemannian and it turns out that the natural gradient, pointing to the steepest ascent direction, 
results if we multiply the gradient in Eq. (19.56) by W 1 W, which is the corresponding Riemannian 
metric tensor [68]. 

Remarks 19.4. 

• From the gradient in Eq. (19.56), it is easy to see that at a stationary point the following is true: 

dJ(W) T T 

-—lw T =W-<l>Wz T ] = 0. (19.61) 

In other words, what we achieve with ICA is a nonlinear generalization of PCA. Recall that for the 
latter, the uncorrelatedness condition can be written as 


E[I — zz T ] = O. (19.62) 

The presence of the nonlinear function </) takes us beyond simple uncorrelatedness, and brings 
the cumulants into the scene. As a matter of fact, Eq. (19.61) was the one that inspired the early 
pioneering work on ICA, as a direct nonlinear generalization of PCA [93,107]. 

• The origins of ICA are traced back to the seminal paper [93]. For a number of years, it remained 
an activity pretty much within the French signal processing and statistics communities. Two papers 
were catalytic for its widespread use and popularity, namely, [18] in the mid-1990s and the devel- 
opment of the FastICA 1 [99], which allowed for efficient implementations (see [108] for a related 
review). 

• In machine learning, the use of ICA as a feature generation technique is justified by the following 
argument. In [17], it is suggested that the outcome of the early processing performed by the visual 
cortical feature detectors might be the resuit of a redundancy reduction process. Thus, searching for 
independent features, conditioned on the input data, is in line with such a claim (see, for example, 
[75,114] and the references therein). 
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http://research.ics.aalto.fi/ica/fastica/index.shtml. 
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• Although we have focused on the noiseless case, extensions of ICA to noisy tasks have also been 
proposed (see, e.g., [100]). For an extension of ICA in the complex-valued case, see [2]. Nonlinear 
extensions have also been considered, including kernelized ICA versions (for example, [13]). 

• In [3], the treatment of ICA also involves random processes and a wider class of signals, including 
Gaussians, can be identified. 

• In [7], the multiset ICA framework of independent vector analysis (IVA) is discussed. It is shown 
that it generalizes the multiset CCA if higher-order, besides second-order, statistics are taken into 
account. 

19.5.5 ALTERNATIVE PATHS TO ICA 

Besides the previously discussed two paths to ICA, a number of alternatives have been suggested, 

shedding light on different aspects of the problem. Some notable directions are the following. 

• Infomax principle: This method assumes that the latent variables are the outputs of a nonlinear 
system (neural network, Chapter 18) of the form 

Z; = (pj(wjx) + r), 1 = 1,2,...,/, 

where </>,• are nonlinear functions and r| is additive Gaussian noise. The weight vectors ut, are com- 
puted so as to maximize the entropy of the outputs; the reasoning is based on some information 
theoretic arguments concerning the information flow in the network [18]. 

• Maximum likelihood: Starting from Eq. (19.44), the PDF of the observed variables is expressed in 
terms of the PDFs of the independent components 


; 

p(x) = |det(VF)| ]~[ pi (w 7 Xi), 

i=t 


where we used 

W:=A~K 

Assuming that we have N observations, xi,X2, ■ ■ ■ ,xn, and taking the logarithm of the joint 
p(x i,... , x ,\; ), one can maximize the log-likelihood with respect to W. It is straightforward to de- 
rive the log-likelihood function and to observe that it is very similar to J (W) given in Eq. (19.55). 
The pi s are chosen so as to belong to families of non-Gaussians (for example, [ 100]). A connection 
between the infomax approach and the maximum likelihood one has been established in [37,38]. 

• Negentropy: According to this method, the starting point is to maximize the non-Gaussianity, which 
is now measured in terms of the negentropy , defined as 

/(z) := H (zcauss) - H (z), 

where ZQauss corresponds to Gaussian distributed variables of the same covariance matrix, which 
we know corresponds to the maximum entropy, H. Thus, maximizing the negentropy, which is a 
nonnegative function, is equivalent to making the latent variables as less Gaussian as possible. Usu- 
ally, approximations of the negentropy are employed, which are expressed in terms of higher-order 
cumulants, or by matching the nonlinearity to source distribution [100,141]. 
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• If the unmixing matrix is constrained to be orthogonal, the negentropy and the maximum likelihood 

approaches become equivalent [ 2 ]. 

THE COCKTAIL PARTY PROBLEM 

A classical application that demonstrates the power of the ICA is the so-called cocktail party problem. 
In a party, there are various people speaking; in our case, we are going to consider music as well. Let 
us say that there are people (a female and a male) and there is also monophonic music, making three 
sources of sound in total. Then, three microphones (as many as the sources) are placed in different 
places in the room and the mixed speech signals are recorded. We denote the inputs to the three micro¬ 
phones as x\(t), X 2 (t), and x?,(t), respectively. In the simplest of the models, the three recorded signals 
can be considered as linear combinations of the individual source signals. Delays are not considered. 
The goal is to use ICA and recover the original speech and music from the recorded mixed signals. 

To this end and in order to bring the task in the formulation we have previously adopted, we consider 
the values of the three signals at different time instants as different observations of the corresponding 
random variables, xj, X 2 , and X 3 , which are put together to form the random vector x. We further adopt 
the very reasonable assumption that the original source signals, denoted as s\(t), siit). and .V 3 (r), are 
independent and (similarly as before) the values at different time instants correspond to the values of 
three latent variables, denoted together as a random vector s. 

We are ready now to apply ICA to compute the unmixing matrix W, from which we can obtain the 
estimates of the ICs corresponding to the observations received by the three microphones, 

Z(0 = [Zl(0,Z2(0,Z3(0] r = W[X\{t), X 2 .it), xj,{t)] T . 

Fig. 19.9A shows the three different signals, which are linearly combined (by a set of mixing coef- 
hcients defining a mixing matrix A) to form the three “microphone signals.” Fig. 19. 9B shows the 
resulting signals, which are then used as described before for the ICA analysis. Fig. 19. 9C shows the 
recovered original signals, as the corresponding ICs. The FastICA algorithm was employed. Fig. 19.10 
is the resuit when PCA is used and the original signals are obtained via the (three) principal compo- 
nents. 

One can observe that ICA manages to separate the signals with very good accuracy, whereas PCA 
fails. The reader can also listen to the signals by downloading the corresponding “.wav” files from the 
site of this book. 

Note that the cocktail party problem is representative of a large class of tasks where a number of 
recorder signals resuit as linear combinations of other independent signals; the goal is the recovery 
of the latter. A notable application of this kind is found in electroencephalography (EEG). EEG data 
consist of electrical potentials recorded at different locations on the scalp (or more recently, in the 
ear [ 112 ]), which are generated by the combination of different underlying components of brain and 
muscle activity. The task is to use ICA to recover the components, which in tum can unveil useful 
information about the brain activity (for example, [158]). 

The cocktail party problem is a typical example of a more general class of tasks known as blind 
source separatiori (BSS). The goal in these tasks is to estimate the “causes” (sources, original signals) 
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FIGURE 19.9 

ICA source separatiori in the cocktail party setting. 
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FIGURE 19.10 

PCA source separation in the cocktail party setting. 


based only on information residing in the observations, without any other extra information, and this 
is the reason that the word “blind” is used. Viewed in another way, BSS is an example of unsupervised 
learning. ICA is one among the most widely used techniques for such problems. 
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FIGURE 19.11 

The setup for the ICA simulation example. The two vectors point to the projection directions resulting from the 
analysis. The optimal direction for projection, resulting from the ICA analysis, is that of W2- 


Example 19.4. The goal of this example is to demonstrate the power of ICA as a feature generation 
technique, where the most informative of the generated features are to be kept. 

The example is a realization of the case shown in Fig. 19.11. A number of 1024 samples of a 
two-dimensional normal distribution were generated. 

The mean and covariance matrix of the normal PDF were 


fi = [- 2.6042, 2.5f, 


E = 


10.5246 

9.6313 


9.6313 

11.3203 


Similarly, 1024 samples from a second normal PDF were generated with the same covariance ma¬ 
trix and mean —fi. For the ICA, the method based on the second- and fourth-order cumulants, presented 
in this section, was used. The resulting transformation matrix W is 


—0.7088 0.7054’ 


r t~ 

w[ 

0.7054 0.7088 


S 

to ^ 

1_ 


The vectors w 2 and w \ point in the principal and minor axis directions, respectively, obtained from the 
PCA analysis. According to PCA, the most informative direction is along the principal axis W 2 , which 
is the one with maximum variance. However, the most interesting direction for projection, according to 
the ICA analysis, is that of w \. Indeed, the kurtosis of the obtained ICs Z [, Z 2 along these directions are 


*4(zi) = -1-7, 
Ar 4 (z 2 ) = 0.1, 


respectively. Thus, projection in the principal (PCA) axis direction results in a variable with a PDF 
close to a Gaussian. The projection on the minor axis direction results in a variable with a PDF that 
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deviates from the Gaussian (it is bimodal) and it is the more interesting one from the classification 
point of view. This can be easily verified by looking at the figure; projecting on the direction u >2 leads 
to class overlapping. 


19.6 DICTIONARY LEARNING: THE k-SVD ALGORITHM 

The concept of overcomplete dictionaries and their importance in modeling real-world signals have 
been introduced in Chapter 9. We return to this topic, this time in a more general setting. There, the 
dictionary was assumed known with preselected atoms. In this section, the blind version of this task is 
considered; that is, the atoms of the dictionary are unknown and have to be estimated from the observed 
data. Recall that this was the case with ICA; however, instead of the independence concept, used for 
ICA, sparsity arguments will be mobilized here. Giving the freedom to the dictionary to adapt to the 
needs of the specific, each time, input can lead to enhanced performance compared to dictionaries with 
preselected atoms. 

Our starting point is that the observed / random variables are expressed in terms of m > l latent 
ones according to the linear model 


x = Az, xel 1 , zeR'", (19.63) 

and A is an unknown / x m matrix. Usually, m ~S> l. Even if A were known and fixed, it does not 
need special mathematical skills to see that this task has not a single solution and one has to embed 
constraints into the problem. To this end, we are going to adopt sparsity promoting constraints, as we 
have already discussed in various parts in this book. 

Let x„, n = 1,2,..., N, be the observations that will constitute the only available information. 
The task is to obtain the atoms (columns of A) of the dictionary as well as the latent variables that are 
assumed to be sparse; that is, we are going to establish a sparse representation of our input observations 
(vectors). No doubt, there are different paths to achieve the goal. We are going to focus on one of the 
most widely known and used methods, known as k-SVD, proposed in [4]. 

Let X := [jci, ..., x^], A := {ai,..., a m }, and Z := [zi,..., ZjvL where z„ is the latent vector 
corresponding to the input x„, n = 1,2,..., N. The dictionary learning task is cast as the following 
optimization problem 


(19.64) 

(19.65) 


where To is a threshold value and || ■ ||o denotes the L) norm, as discussed in Chapter 9. This is a 
nonconvex optimization task, and it is performed iteratively; each iteration comprises two stages. In the 
first one, A is assumed to be fixed and optimization is carried out with respect to z„, n — 1,2,..., N. 
In the second stage, the latent vectors are assumed fixed and optimization is carried out with respect to 
the columns of A. 

In k-SVD, a slightly different rationale is adopted. While optimizing with respect to the columns 
of A, one at a time, an update of some of the elements of Z is also performed. This is a crucial difference 


minimize with respect toA,Z \\X — AZ\\ F , 

subject to HzhIIo < To, n = 1,2,..., N, 
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of fc-SVD with the more Standard optimization techniques; it appears to lead to improved performance 
in practice. 

Stage 1: Assume A to be known and fixed to the value obtained from the previous iteration. Then 
the associated optimization task becomes 

min IIX — AZIli, 

z 

S.t. ||z„No<7b, n=l,2,...,N, 

which, due to the definition of the Frobenius norm, is equivalent to solving N distinet optimization 
tasks, 

min \\Xn-AznW 1 , 

Zn 

s.t. ||z„||o<7b, n — 1,2,..., N 
A similar objective is met if the following optimization tasks are considered instead: 

min ||z„||o, 

Zn 

s.t. ||jc , 2 — Az /2 || 2 < e, n = 1,2,..., N, 

where 6 is a constant acting as an upper bound of the error. 

The task in Eqs. (19.66) and (19.67) can be solved by any one of the minimization solvers, which 
have been considered in Chapter 1 0, for example, the OMP. This stage is known as sparse coding. 

Stage 2: This stage is known as the codebook update. Having obtained z„, n = 1,2,..., N (for 
fixed A), from stage 1, the goal now is to optimize with respect to the columns of A. This is achieved 
on a column-by-column basis. Assume that we currently consider the update of a k \ this is carried out 
so as to minimize the (squared) Frobenius norm, ||X — AZ\\ 2 F . To this end, we can write the product 
AZ as a sum of rank one matrices, that is, 


(19.66) 

(19.67) 


AZ = [ai. a m ][z[,..., z r m ] T = ^a,z[ 7 ’, (19.68) 

i=l 

where z' 7 . 1 = 1,2,..., m, are the rows of Z. Note that in the above sum, the vectors for indices, 
i = 1,2 ,..., k — 1, are fixed to their recently updated values during this second stage of the current 

iteration step, while and vectors corresponding to i —k + 1.m, are fixed to the values that are 

available from the previous iteration step. This strategy allows for the use of the most recent updated 
information. We will now minimize with respect to the rank one outer product matrix, akZ k T ■ Observe 
that this product, besides the /rth column of A, also involves the Arth row of Z; both of them will be 
updated. The rank one matrix is estimated so as to minimize 

\\E k -a k z r k T \\ 2 F , (19.69) 


m 

E k :=X- a 'Zi T - 

i=\,i^k 


where 
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In other words, we seek to find the best, in the Frobenius sense, rank one approximation of E k . Recall 
from Chapter 6 (Section 6.4) that the solution is given via the SVD of Ek- However, if we do that, there 
is no guarantee that whatever sparse structure has been embedded in z r k , from the update in stage 1, will 
be retained. According to the A:-SVD, this is bypassed by focusing on the active set, that is, involving 
only the nonzero of its coefficients. Thus, we first search for the locations of the nonzero coefficients 
in z r k and let 



jk, l<jk<N: z r k (jk) ^ 0[. 


Then, we form the reduced vector z. k e where \m k \ denotes the cardinality of a>k, which contains 
only the nonzero elements of z k . A little thought reveals that when writing X = AZ, the column of 
current interest, ak, contributes (as part of the corresponding linear combination) only to the columns 
Xj k , jk e o> k , of X. We then collect the corresponding columns of E k to construet a reduced-order 
matrix, Ek, which comprises the columns that are associated with the locations of the nonzero elements 
of z\, and select akz' k T so as to minimize 


\\E k -a k ~z r k r \\ 2 F . 


(19.70) 


Performing SVD, E k = U D V T , dk is set equal to u \ corresponding to the largest of the singular values 
and z k = D(\, 1) u i. Thus, the atoms of the dictionary are obtained in normalized form (recall from the 
theory of SVD that ||ui|| = 1). In the sequel, the updated values obtained for z' k are placed in the 
corresponding locations in z\. The latter now has at least as many zeros as it had before, as some of 
the elements in v i may be zeros. Simple arguments (Problem 19.3) show that at each iteration the error 
decreases and the algorithm converges to a local minimum. The success of the algorithm depends on 
the ability of the greedy algorithm to provide a sparse solution during the first stage. As we know from 
Chapter 10, greedy algorithms work well for sparsity levels, Tq, small enough compared to /. 

In summary, each iteration step of the k -SVD algorithm comprises the following computation steps. 

• Initialize A ((}} with columns normalized to unit £2 norm. 


• Set i — 1. 


• Stage 1 : Solve the optimization task in Eqs. (19.66) and (19.67) to obtain the sparse coding repre- 
sentation vectors, z. n , n = 1,2,..., N; use any algorithm developed for this task. 

• Stage 2: For any column, k = 1.2,...,/;/, in r> , update it according to the following: 

- Identify the locations of the nonzero elements in the kth row of the computed, from stage 1, 
matrix Z. 

- Select the columns in E k , which correspond to the locations of the nonzero elements of the /cth 
row of Z and form a reduced-order error matrix, E k . 

- Perform SVD on E k : E k — UDV T . 

- Update the kth column of A 1 ' 1 to be the eigenvector corresponding to the largest singular value, 



- Update Z, by embedding in the nonzero locations of its A:th row the values D(l, I )r[. 

• Stop if a convergence criterion is met. 

• If not, i = i + 1, and continue. 
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WHY THE NAME &-SVD? 

The SVD part of the name is pretty obvious. However, the reader may wonder about the presence 
of “k” in front. As stated in [4], the algorithm can be considered a generalization of the k-means 
algorithm, introduced in Chapter 12 (Algorithm 12.1). There, we can consider the mean values, which 
represent each cluster, as the code words (atoms) of a dictionary. During the first stage of the k-means 
learning, given the representatives of each cluster, a sparse coding scheme is performed; that is, each 
input vector is assigned to a single cluster. Thus, we can think of the k-means clustering as a sparse 
coding scheme that associates a latent vector with each one the observations. Note that each of the 
latent vectors has only one nonzero element, pointing to the cluster where the respective input vector is 
assigned, according to the smallest Euclidean distance from all cluster representatives. This is a major 
difference with the k -SVD dictionary learning, during which each observation vector can be associated 
with more than one atom; hence, the sparsity level of the corresponding latent vector can be larger than 
one, Furthermore, based on the assignment of the input vectors to the clusters, in the second stage of the 
k-means algorithm, an update of the cluster representatives is performed, and for each representative 
only the input vectors assigned to it are used. This is also similar in spirit to what happens in the second 
stage of k-SVD. The difference is that each input observation may be associated with more than one 
atom. As pointed out in [4], if one sets 7o = 1, the £-means algorithm can resuit from A:-SVD. 

DICTIONARY LEARNING AND DICTIONARY IDENTIFIABILITY 

Dictionary learning was introduced via the to norm in (19.64)—(19.65). In a more general setting, the 
dictionary learning task is cast as follows: 

minimize with respect to A e A and Z [\\X — AZ\\ 2 F + Xg(Z)}, (19.71) 

where g(Z) is a sparsity promoting function of the elements of matrix Z. For example, 

m N 

g(Z) = EE \Z(i,j)\, (19.72) 

i=l 7=1 

and k is a regularization parameter. The set A C R /xn ' is a compact constraint set. For example, often 
we assume that the columns of A have unit norm. This guarantees scale invariance. Indeed, assuming 
A, Z is a solution, if we do not use any constraint, the set A' = (cA), Z' — (\.Z) would also be a 
solution, for any value of c. Other constraints can also be used to account for extra a priori knowledge 
concerning the available data (see, e.g., [85]). 

In practice, the solution involves two stages, as with k-SVD. Assuming that A (,) is known at the 
iteration step i, minimization with respect to the code vectors first takes place, i.e., 

Stage 1\ 

Z (i) = mm {| \X - A< ,- >Z| \\ + Xg(Z )}. 

Then, fixing Z u) , the code book update results as 


(19.73) 
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Stage 2: 

A (,+1) = min \\X- AZ (i) \\l. (19.74) 

AeA 

In the last equation, the sparsity promoting terni is not involved, since it is independent of A. Assuming 
g(Z) is a convex function, at each stage a convex optimization task is solved. Such problems are known 
as biconvex, in the sense that by fixing one of the arguments in the cost function the optimization 
problem becomes a convex one with respect to the other. The solution of the optimization task can 
be obtained via different paths; for example, via majorize-minimize (MM) techniques or the ADMM 
scheme (Section 8.14); see, e.g., [80,140,196] for some application cases under various constraints. 

Solving the dictionary learning task is a nonconvex optimization problem. An important related 
issue is that of the characterization of the associated local minima. This task is another face of what we 
call identifiability . In other words, assuming that there exists a dictionary that generates the data, can 
this be recovered? To know the answer to this question is of major importance in a number of appli- 
cations. For example, in a source localization task, the atoms of the dictionary relate to the directions 
of the arrival of the involved signals. Also, in fMRI (see Section 19.11) the atoms of the dictionary 
provide information related to the time responses of the neurons associated with the activated regions 
in the brain. In [86], it is shown that, under certain assumptions related to sparsity, the minimized 
cost function in the dictionary learning task has a guaranteed local minimum around the dictionary 
that generates the data, with high probability. Moreover, the bound on the number of samples that are 
sufficient for the existence of such a minimum is also derived. 

Remarks 19.5. 

• Alternative paths to k-SVD to dictionary learning have also been suggested. For example, in [73] 
a dictionary learning technique referred to as method of optimal directions (MOD) was proposed, 
which differs from k-SVD in the dictionary update step. In particular, the full dictionary is updated 
via direct minimization of the Frobenius norm. In [126,143] probabilistic arguments are employed, 
using a Laplacian prior to enforce sparsity. We know from Chapter 13 (Section 13.5) that in this 
case, the involved integrations are not analytically tractable and the different methods differ in the 
different approximations used to bypass this obstacle. In the former, the maximum value of the 
integrand is used and in the latter a Gaussian approximation of the posterior is adopted in order to 
handle the integration. In [82], variational bound techniques are mobilized (see Section 13.9). 

• The method proposed in [125] bears some similarities to k-SVD, because it also revolves around 
SVD, but the dictionary is constrained to be a union of orthonormal bases. This can lead to some 
computational advantages; on the other hand, k-SVD puts no constraints on the atoms of the 
dictionary, which gives more freedom in modeling the input. Another difference lies in the column- 
by-column update introduced in k-SVD. 

• A more detailed comparative study of k-SVD with other methods is given in [4]. 

• Dictionary learning is essentially a matrix factorization problem where a certain type of constraint 
is imposed on the right matrix factor. This approach can be considered to be just a manifestation 
of a wider class of constrained matrix factorization methods that allow several types of constraints 
to hold. Such techniques include the regularized PCA, where functional and/or sparsity constraints 
are imposed to the left and to the right factors [12,190,200], as well as the structured sparse matrix 
factorization in [14]. 
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• Besides the previously reported algorithms, a number of Online and distributed dictionary learning 
schemes have been proposed. In [127], a distributed version of the k-SYD algorithm is developed. 
A cloud-based version of &-SVD is proposed in [149] with an emphasis on big data applications. 
A dictionary learning algorithm employing arguments inspired by the EXTRA optimization scheme, 
which is discussed in Section 8.15, is described in [182]. In [49], the various agents/nodes learn dif¬ 
ferent parts of the dictionary. Such an approach is also suited for big data applications. An Online 
algorithm for dictionary learning has been presented in [139] and an Online version for a distributed 
setting, with provable convergence to a stationary point, has been proposed in [53]. In [62], the de- 
centralized case over time-varying digraphs is considered that involves a general set of constraints, 
where a number of matrix factorization schemes resuit as special cases. 

Example 19.5. The goal of this example is to show the performance of the dictionary learning tech- 
nique in the context of the image denoising task. In the case study of Section 9.10, image denoising, 
based on a predetermined and fixed DCT dictionary, was considered. Here, A:-SVD will be employed in 
order to learn the dictionary using information of the image itself. The two (256 x 256) images, without 
and with noise corresponding to PSNR = 22 are shown in Figs. 19.13A and B, respectively. The noisy 
image is divided in overlapping patches of size 12 x 12 (144), resulting in (256 — 12 + l) 2 = 60, 025 
patches in total; these will constitute the training data set used for the learning of the dictionary. Specif- 
ically, the patches are sequentially extracted from the noisy image, vectorized in lexicographic order, 
and used as columns, one after the other, to define the (144 x 60, 025) matrix X. Then, k -SVD is mo- 
bilized to train an overcomplete dictionary of size 144 x 196. The resulting atoms, reshaped in order to 
form 12 x 12 pixel patches, are shown in Fig. 19.12. Compare the atoms of this dictionary with atoms 
of the fixed DCT dictionary of Fig. 9.14. 

Next, we follow the same procedure as in Section 9.10, by replacing the DCT dictionary with the 
one obtained by the A-SVD method. The resulting denoised image is shown in Fig. 19.13C. Note that 
although the dictionary was trained based on the noisy data, it led to about 2 dB PSNR improvement 
over the fixed-dictionary case. As a matter of fact, because the number of patches is large and each 
one of them carries a different noise realization, the noise, during the dictionary learning stage, is 
averaged out leading to nearly noise-free dictionary atoms. More advanced use of dictionary learning 
techniques to further improve performance in tasks such as denoising and inpainting can be found in 
[71,72,138]. 


19.7 N0NNEGATIVE MATRIX FACTORIZATION 

The strong connection between dimensionality reduction and low rank matrix factorization has already 
been stressed while discussing PCA. ICA can also be considered as a low rank matrix factorization, if a 
smaller number, compared to the / observed random variables, of independent components is retained 
(e.g., selecting the m < l least Gaussian ones). 

An alternative to the previously discussed low rank matrix factorization schemes was suggested 
in [144,145], which guarantees the nonnegativity of the elements of the resulting matrix factors. Such 
a constraint is enforced in certain applications because negative elements contradict physical reality. 
For example, in image analysis, the intensity values of the pixels cannot be negative. Also, probability 
values cannot be negative. The resulting factorization is known as nonnegative matrix factorization 
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FIGURE 19.12 

Dictionary resulting from &-SVD. 


Original image Image + noise (PSNR = 22) Denoised image (PSNR = 30.1) 



(A) (B) (C) 


FIGURE 19.13 

Image denoising based on dictionary learning. 


(NMF) and it has been used successfully in a number of applications, including document clustering 
[192], molecular pattern discovery [28], image analysis [122], clustering [171], music transcription, 
music instrument classification [20,166], and face verification [198]. 
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Given an / x /V matrix X, the task of NMF consists of finding an approximate factorization of X, 
that is, 


X^AZ, (19.75) 

where A and Z are l x m and m x N matrices, respectively, m < min(/V, /), and all the matrix elements 
are nonnegative, that is, A(i, k) > 0, Z(k , j) > 0, i = 1,2,... ,1, k= 1,2,..., m, j = 1,2,..., N. 
Clearly, if matrices A and Z are of low rank, their product is also a low rank, at most m, approximation 
of X. The significance of the above is that every column vector in X is represented by the expansion 


m 

Xi ~ ^ Z(A:, i)dk, i = 1,2,..., N, 

k= 1 

where a^, k — 1,2,..., m, are the column vectors of A and constitute the basis of the expansion. The 
number of vectors in the basis is less than the dimensionality of the vector itself. Hence, NMF can also 
be seen as a method for dimensionality reduction. 

To get a good approximation in Eq. (19.75) one can adopt different costs. The most common cost 
is the Frobenius norm of the error matrix. In such a setting, the NMF task is cast as follows: 

/ N 

min \\X-AZ\\ 2 F :=J2J2(X(iJ)-lAZKi,j)) 2 , (19.76) 

A ' Z i =i 7=1 

s.t. A{i, k) > 0, Z(k, j) > 0, (19.77) 

where [ AZ](i, j ) is the (i, j ) element of matrix AZ, and i, j, k run over all possible values. Besides 
the Frobenius norm, other costs have also been suggested (see, e.g., [168]). 

Once the problem has been formulated, the major issue rests at the solution of the optimization 
task. To this end, a number of algorithms have been proposed, for example, Newton-type or gradient 
descent-type algorithms. Such algorithmic issues, as well as a number of related theoretic ones, are 
beyond the scope of this book, and the interested reader may consuit, for example, [54,67,178]. More 
recently, regularized versions, including sparsity promoting regularizers, have been proposed (see, for 
example, [55] for a more recent review on the topic). 


19.8 LEARNING L0W-DIMENSI0NAL MODELS: 

A PROBABILISTIC PERSPECTIVE 

In this section, the emphasis is on looking at the dimensionality reduction task from a Bayesian per- 
spective. Our focus will be more on presenting the main ideas and less on algorithmic procedures; the 
latter depend on the specific model and can be dug out from the palette of algorithms that have already 
been presented in Chapters 12 and 13. Our path to low-dimensional modeling traces its origin to the 
so-called factor analysis. 




19.8 LEARNING LOW-DIMENSIONAL MODELS 1077 


19.8.1 FACTOR ANALYSIS 

Factor analysis was originally proposed in the work of Charles Spearman [169]. Charles Spearman 
(1863-1945) was an English psychologist who has made important contributions to statistics. Spear¬ 
man was interested in human intelligence and developed the method in 1904, for analyzing multiple 
measures of cognitive performance. He argued that there exists a general intelligence factor (the so- 
called g-factor) that can be extracted by applying the factor analysis method on intelligence test data. 
However, this notion has been strongly disputed, as intelligence comprises a multiplicity of compo- 
nents (see, e.g., [84]). 

Let xel 1 . The factor analysis model assumes that there are m <1 underlying (latent) zero mean 
variables or factors z e M m so that 


x/ — m = cijj zj + 6 i, i = 1,2,...,/, (19.78) 

j= i 


or 


x— /i — Az + e, (19.79) 

where fi is the mean of x and A e R ,xm is formed by the weights a, j known as factor loadings. The 
variables z j, j — 1 , 2 , ..., m, are sometimes called common factors, because they contribute to ali 
observed variables, x,, and e, are the unique or specific factors. As we have already done so far and 
without loss of generality, we will assume our data are centered, that is, fi = 0. In factor analysis, we 
assume e,- to be of zero mean and mutually uncorrelated, that is, S e = E[€€ 7 ] := di ag{ rx p. of ,..., er, 2 }. 
We also assume that z and € are independent. The m (< l) columns of A form a lower-dimensional 
subspace, and e is that part of x not contained in this subspace. The first question that is now raised is 
whether the model in Eq. (19.79) is any different from our familiar regression task. The answer is in the 
affirmative. Note that here the matrix A is not known. All that we are given is the set of observations, 
x n , n — 1,2,..., /V, and we have to obtain the subspace described by A. It is basically the same linear 
model that we have considered so far in this chapter, with the difference that now we have introduced 
the noise term. Once A is known, z n can be obtained for each x n . 

From Eq. (19.79), it is readily seen that 

S x = E[xx 7 ] = A E[zz r ] A r + S e . 

We will further assume that E[zz 7 ] = / ; hence, we can write 

Z x = AA T + X). (19.80) 

Hence, A results as a factor of (S x — S € ). However, such a factorization, if it exists, is not unique. This 
can be easily checked if we consider A = AU, where U is an orthonormal matrix. Then AA T — A A 1 . 
This has brought a lot of controversy around the factor analysis method when it comes to interpreting 
individual factors (see, for example, [44] for a discussion). To remedy this drawback, a number of 
authors have suggested methods and criteria that deal with the rotation (orthogonal or oblique) in order 
to gain improved interpretation of the factors [i 57]. However, from our perspective, where the goal is 
to express our problem in a lower-dimensional space, this is not a problem. Any orthonormal matrix 
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imposes a rotation within the subspace spanned by the columns of A , but we do not care about the 
exact choice of the coordinates, that is, the common factors. 

There are different methods to obtain A (see, e.g., [64]). A popular one is to assume p(x) to be 
Gaussian and employ the maximum likelihood method to optimize with respect to the unknown pa- 
rameters that define L x in Eq. (19.80). Once A becomes available, one way to estimate the factors is 
to further assume that these can be expressed as linear combinations of the observations, that is, 

z = Wx. 


Postmultiplying by x, taking expectations, and recalling Eq. (19.79) and that E[zz r ] = /, we get 



E[zx r ] = E[zz T A T ]+E[z€ T ] = A t . 

(19.81) 

Also, 




E[zx r ] = WEtxx 7 ’] = WS X . 

(19.82) 

Hence, 




W = A r 


Thus, given a value jr, the values of the corresponding latent variables are obtained by 

z = A T E~ l x. (19.83) 


19.8.2 PROBABILISTIC PCA 

New light on this old problem was shed via the Bayesian rationale in the late 1990s [154,175,176]; the 
task was treated for the special case S e = a 2 I and it was named probabilistic PCA (PPCA). The latent 
variables, z, are dressed with a Gaussian prior, 


p(z)=Af(z | 0 , /), 

which is in agreement with the earlier assumption E[zz 7 '] = I, and the conditional PDF is chosen as 

p(x\z) =M (x\Az,<r 2 l \, 

where, for simplicity, we assume fi — 0 (otherwise the mean would be Az + fi). We are by now pretty 
familiar with writing down 

p(.z\x)=AT(z\n zlx ,E l \ x ), (19.84) 

and 

p(x)=AT(x\0,S x ), (19.85) 

where (see Eqs. (12.17), (12.10), and (12.15), Chapter 12) 


S z\x = 



-1 


(19.86) 
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H zlx =\s l[x A T x, (19.87) 

U x = a 2 l + AA T . (19.88) 


Note that using the Bayesian framework, the computation of the latent variables corresponding to a 
given set of observations, x, can naturally be obtained via the posterior p(z\x) in Eq. (19.84). For 
example, one can pick the respective mean value 

z = X;Z z \ x A r x. (19.89) 

a 1 

Using the matrix inversion lemma (Problem 19.4), it turns out that Eqs. (19.83) and (19.89) are exactly 
the same; however, now, it comes as a natural consequence of our Bayesian assumptions. 

One way to compute A is to apply the maximum likelihood method on P( x n) and maximize 
with regard to A and a~ (and fi, it' p, ^ 0). It turns out that the maximum likelihood solution for A is 
given by [175] 

Aml = U m diagjli - a 2 ,..., k m - o 2 }R, 

where U m is the / x m matrix with columns the eigenvectors corresponding to the m largest eigenval- 
ues, Xi, i = 1,2,, m, of the sample covariance matrix of x, and R is an arbitrary orthogonal matrix 
(RR 7 — I ). Setting R — I, the columns of A are the (scaled) principal directions as computed by the 
classical PCA, discussed in Section 19.3. In any case, the columns of A span the principal subspace of 
the Standard PCA. Note that as cr 2 —> 0, PPCA tends to PCA (Problem 19.5). Also, it turns out that 


J ML ■ 


E 


/=m+1 


(19.90) 


The previously established connection with PCA does not come as a surprise. It has been well 
known for a long time (e.g., [5]) that if in the factor analysis model one assumes E ( =o 2 l, then 
at stationary points of the likelihood function the columns of A are scaled eigenvectors of the sam¬ 
ple covariance matrix. Furthermore, a 2 is the average of the discarded eigenvalues, as suggested in 
Eq. (19.90). 

Another way to estimate A and a 2 is via the EM algorithm [154,175]. This is possible because we 
have p{z\x) in an analytic form. Given the set (x„, z n ), n = 1,2,..., N, of the observed and latent 
variables, the complete log-likelihood function is given by 


ln 


N 

p{X,Z- A, a 2 ) = p{x n \Zn\ A, a 2 ) + ln p(z n )^ 


n = I 


N 


n= 1 


= “E ( 9 ln(2?r ) - 9 ln £+ ~ II Xn - Az n I 


+ —ln(27T) + ~Z n Zn 
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which is of the same form as the one given in Eq. (12.45). We have used P — Thus, following 
similar steps as for Eq. (12.45) and rephrasing Eqs. (12.46)—(12.50) to our current notation, the E-step 
becomes 

• E-step: 


Q(A, p- AV\pW) = -J2{~ l 2 lnl3+ \l I" + T trace | S zU | 

«=1 

+ ^|x„ - A/t^(u)|- + ^trace |ArJ l ) A r |) +C, 

where C is a constant and 

= p (j) Z ( £A ( j )T x n , = (/ + P^A^A^y 1 . 


• M-step: Taking the derivatives with regard to p and A and equating to zero (Problem 19.6), we 
obtain 


A°' +1) = (nZ ( £ + E^(»)^ r (»)^ 


«0+1) Nl 

(Jx„ - A0'+D^(n) 2 + trace 

AO+P^AO+pr 



(19.91) 


(19.92) 


Observe that having adopted the EM algorithm, one does not need to compute the eigenval- 
ues/eigenvectors of E x . Even retrieving only the m principal components, the lowest cost one has 
to pay is 0(ml 2 ) operations. Beyond that, (D(Nl 2 ) are needed to compute E x . For the EM approach, 
the covariance matrix need not be computed and the most demanding part comprises the matrix vector 
products, which amount to O(Nml). Hence for m <$C /, computational savings are expected compared 
to the classical PCA. Keep in mind, though, that the two methods optimize different criteria. PCA 
guarantees minimum least-squares error reconstruction, PPCA via the EM optimizes the likelihood. 
Thus, for applications where the error reconstruction is important, such as in compression, one has to 
be aware of this fact; see [175] for a related discussion. 

The other alternative route to solve PPCA is by considering A and <r 2 as random variables with 
appropriate priors and applying the variational EM algorithm (see [24]). This has the added advantage 
that if one uses as the prior 
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FIGURE 19.14 

Data points are distributed around a straight line (one-dimensional subspace) in R 2 . The subspace is fully recovered 
by the PPCA, running the EM algorithm. 


with different precisions, ak= 1,2,, m, per column, then using large enough m one can achieve 
pruning of the unnecessary components; this was discussed in Section 13.5. Hence, such an approach 
could provide the means of automatic determination of m. The interested reader, besides the references 
given before, can dig out useful related information from [47]. 

Example 19.6. Fig. 19.14 shows a set of data which have been generated via a two-dimensional Gaus- 
sian, with zero mean value and covariance matrix equal to 


17 = 


5.05 -4.95 
-4.95 5.05 


The corresponding eigenvalues/eigenvectors are computed as 

7-1 = 0.05, a\ = [1, l] r , 

= 5.00, a 2 = [-l,lf. 


Observe that the data are distributed mainly around a straight line. The EM PPCA algorithm was 
run on this set of data for m = 1. The resulting matrix A, which now becomes a vector, is 

a = [—1.71, 1.71] r 

and p — 0.24. Note that the obtained vector a points in the direction of the line (subspace) around 
which the data are distributed. 

Remarks 19.6. 

• In PPCA, a special diagonal structure was assumed for . An EM algorithm for the more general 
case has also been derived in the early 1980s [156]. Moreover, if the Gaussian prior imposed on the 
latent variables is replaced by another one, different algorithms resuit. For example, if non-Gaussian 
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priors are used, then ICA versions are obtained. As a matter of fact, employing different priors, prob- 
abilistic versions of the canonical correlation analysis (CCA) and the partial least-squares (PLS) 
methods resuit; related references have already been given in the respective sections. Sparsity pro- 
moting priors have also been used, resulting in what is known as sparse factor analysis (for example, 
[8,24]). Once the priors have been adopted, one uses Standard arguments, more or less, to solve the 
task, like those discussed in Chapters 12 and 13. 

• Besides real-valued variables, extensions to categorical variables have also been considered (for 
example, [111]). A unifying view of various probabilistic dimensionality reduction techniques is 
provided in [142]. 

19.8.3 MIXTURE OF FACTORS ANALYZERS: A BAYESIAN VIEW 
TO COMPRESSED SENSING 

Let us go back to our original model in Eq. (19.79) and rephrase it into a more “trendy” fashion. Matrix 
A had dimensions / x m with m < /, and z e R"'. Let us now make m > /. For example, the columns 
of A may comprise vectors of an overcomplete dictionary. Thus, this section can be considered as the 
probabilistic counterpart of Section 19.6. The required low dimensionality of the modeling is expressed 
by imposing sparsity on z; we can rewrite the model in terms of the respective observations as [47] 

x n = A(z n ob) + e„, n = 1,2 ,..., N , 

where N is the number of our training points and the vector b e M'" has elements bj e [0, 1}, i = 
1,2. m. The product z n ° b is the point-wise vector product, that is, 

Zn o b = [z„(l)h,Zn(2)b 2 ,z n (m)b m ] T . (19.93) 

If || fe||o /. then x n is sparsely represented in terms of the columns of A and its intrinsic dimensionality 
is equal to ||ft||o. Adopting the same assumptions as before, 

p(e)=AT(6|0,/l- 1 / / ), p(z)=M(z\0,a- l I m ), 

where now we have explicitly brought l and m into the notation in order to remind us of the associated 
dimensions. Also, for the sake of generality, we have assumed that the elements of z correspond to 
precision values different than one. Following our familiar Standard arguments (as for Eq. (12.15)), it 
is readily shown that the observations x n , n = 1,2,..., N, are drawn from 

x~./V(*|0, Zx), (19.94) 

S x — a~ l AAA r + /l -1 //, (19.95) 


where 


A = diag {fti,..., b m }, 


(19.96) 
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FIGURE 19.15 

Data points that lie close to a hyperplane can be sufficiently modeled by a Gaussian PDF whose high probability 
region corresponds to a sufficiently flat (hyper)ellipsoid. 


which guarantees that in Eq. (19.95) only the columns of A, which correspond to nonzero values of b, 
contribute to the formation of E x . We can rewrite the matrix product in the following form: 

m 

AAA t = ^2bjaiaJ, 

;=i 

and because only \\b\\o'■= k l nonzero terms contribute to the summation, this corresponds to a 
rank k < l matrix, provided that the respective columns of A are linearly independent. Furthermore, 
assuming that f J r 1 is small, E x turns out to have a rank approximately equal to k. 

Our goal now becomes the learning of the involved parameters; that is, A, /3, or, and A. This can 
be done in a Standard Bayesian setting by imposing priors on a, fi (typically gamma PDFs) and for the 
columns of A, 

p(fli) = A^a/IO, j , i — 1,2,.. . ,m, 

which guarantees unit expected norm for each column. The prior for the elements of b are chosen to 
follow a Bernoulli distribution (see [47] for more details). 

Before generalizing the model, let us see the underlying geometric interpretation of the adopted 
model. Recall from our statistics basies (see also Section 2.3.2, Chapter 2) that most of the activity 
of a set of jointly Gaussian variables takes place within an (hyper)ellipsoid whose principal axes are 
determined by the eigenstructure of the covariance matrix. Thus, assuming that the values of x lie close 
to a subspace/(hyper)plane, the resulting Gaussian model Eqs. (19.94) and (19.95) can sufficiently 
model it by adjusting the elements of E x (after training) so that the corresponding high probability 
region forms a sufficiently flat ellipsoid (see Fig. 19.15 for an illustration). 

Once we have established the geometric interpretation of our factor model, let us leave our imag- 
ination free to act. Can this viewpoint be extended for modeling data that originate from a union of 
subspaces? A reasonable response to this challenge would be to resort to a mixture of factors; one for 
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FIGURE 19.16 

The curve (manifold) is covered by a number of sufficiently flat ellipsoids centered at the respective mean values. 


each subspace. However, there is more to it than that. It has been shown (for example, [27]) that a com- 
pact manifold can be covered by a finite number of topological disks, whose dimensionality is equal 
to the dimensionality of the manifold. Associating topological disks with the principal hyperplanes 
that define sufficiently flat hyperellipsoids, one can model the data activity, which takes place along a 
manifold, by a sufficient number of factors, one per ellipsoid [47]. 

A mixture of factor analyzers (MSA) is defined as 

J 

p(x) = J2 p j ^(* I A; ^J' a j a j a ] + //) , (19.97) 

7=1 

where Pj — 1, A j — diag {bj l ,..., bj m }, bj l e [0, 1}, i = 1,2,, m. The expansion in 

Eq. (19.97) for fixed J and preselected A j, for the yth factor, has been known for some time; in 
this context, learning of the unknown parameters is achieved in the Bayesian framework, by impos- 
ing appropriate priors and mobilizing techniques such as the variational EM (e.g., [79]), EM [175], 
and maximum likelihood [ 1 80]. In a more recent treatment of the problem, the dimensionality of each 
A j, j = 1,2,...,/, as well as the number of factors, J, can be learned by the learning scheme. To this 
end, nonparametric priors are mobilized (see, for example, [40,47,94] and Section 13.11). The model 
parameters are then computed via Gibbs sampling (Chapter 14) or variational Bayesian techniques. 
Note that in general, different factors may turn out to have different dimensionality. 

For nonlinear manifold learning, the geometric interpretation of Eq. (19.97) is illustrated in 
Fig. 19.16. The number J is the number of flat ellipsoids used to cover the manifold, //.• are the 
sampled points on the manifold, the columns AjAj (approximately) span the local k-dimensional tan¬ 
gent subspace, the noise variance /3~ l depends on the manifold curvature, and the weights Pj reflect 
the respective density of the points across the manifold. The method has also been used for matrix 
completion (see also Section 19.10.1) via a low rank matrix approximation of the involved matrices 
[40]. 

Once the model has been learned, using Bayesian inference on a set of training data x„, n = 
1, 2,..., N, it can subsequently be used for compressed sensing, that is, to be able to obtain any jc, 
which belongs in the ambient space R 7 but “lives” in the learned k-dimensional manifold, modeled by 
Eq. (19.97), using K «£ / measurements. To this end, in complete analogy with what has been said in 
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Chapter 9, one has to determine a sensing matrix, which is denoted here as O e R Kxl , so as to be able 
to recover x from the measured (projection) vector 

y = Ojc + r/, 

where rj denotes the vector of the (unobserved) samples of the measurement noise; all that is now 
needed is to compute the posterior p(x\y). Assuming the noise samples follow a Gaussian and be- 
cause p(x) is a sum of Gaussians, it can be readily seen that the posterior is also a sum of Gaussians 
determined by the parameters in Eq. (19.97) and the covariance matrix of the noise vector. Hence, 
x can be recovered by a substantially smaller number of measurements, K, compared to /.In [47], 
a theoretical analysis is carried out that relates the dimensionality of the manifold, k , the dimensional- 
ity of the ambient space, /, and Gaussian/sub-Gaussian types of sensing matrices <f>; this is an analogy 
to the RIP so that a stable embedding is guaranteed (Section 9.9). 

One has to point out a major difference between the techniques developed in Chapter 9 and the 
current section. There, the model that generates the data was assumed to be known; the signal, denoted 
there by s, was written as 

s = 

where was the matrix of the dictionary and 0 the sparse vector. That is, the signal was assumed to 
reside in a subspace, which is spanned by some of the columns of 'I'; in order to recover the signal 
vector, one had to search for it in a union of subspaces. In contrast, in the current section, we had to 
“learn” the manifold in which the signal, denoted here by x. lies. 


19.9 NONLINEAR DIMENSIONALITY REDUCTION 

All the techniques that have been considered so far build around linear models, which relate the ob- 
served and the latent variables. In this section, we turn our attention to their nonlinear relatives. Our 
aim is to discuss the main directions that are currently popular and we will not delve into many de- 
tails. The interested reader can get a deeper understanding and related implementation details from the 
references provided in the text. 

19.9.1 KERNELPCA 

As its name suggests, this is a kernelized version of the classical PCA, and it was first introduced 
in [161]. As we have seen in Chapter 11, the idea behind any kernelized version of a linear method 
is to map the variables that originally lie in a low-dimensional space, BE into a high (possibly infi- 
nite)-dimensional reproducing kernel Hilbert space (RKHS). This is achieved by adopting an implicit 
mapping, 

jr e R 1 i—► 4>(x) e H. (19.98) 


6 


Much of this section is based on [174]. 
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Let x n , n = 1,2,..., N, be the available training examples. The sample covariance matrix of the im- 
ages, after mapping into H and assuming centered data, is given by 

1 N 

£ = ^£>(*n)0(*«) r . (19.99) 

n =1 

The goal is to perform the eigendecomposition of E, that is, 

Eu=Xu. (19.100) 

By the definition of E, it can be shown that u lies in the span{0(jti), </>(x 2 ),..., </>(xn)}. Indeed, 


/ 1 N \ ! N 

XU = ( 'Yh < l>( x n)<l> 1 (X n ) 1 M = — ^ (0 ? (x n )u)</>(x, i), 

\ n =1 / n =1 


and for l/0we can write 


N 

ll = ^ n <l){Xn). 

n =1 


(19.101) 


Combining Eqs. (19.100) and (19.101), it turns out (Problem 19.7) that the problem is equivalent to 
performing an eigendecomposition of the corresponding kernel matrix (Chapter 1 1) 


ICa — NXa, ( 19 . 102 ) 

where 

a := [a\, ai ,..., cin] t . (19.103) 

As we already know (Section 11.5.1), the elements of the kernel matrix are tC(i, j) = x ; ) with 

at(-, ■) being the adopted kernel function. Thus, the kth eigenvector of E, corresponding to the /cth 
(nonzero) eigenvalue of /C in Eq. (19.102), is expressed as 

N 

«* = £*»«*„)■ k=l,2,...,p, (19.104) 

n= 1 

where X\ > X 2 > ■. ■ > X p denote the respective eigenvalues in descending order and X p is the smallest 
nonzero one and a\ [ar i,..., ci^n] is the A:th eigenvector of the kernel matrix. The latter is assumed 
to be normalized so that (u, t, Uk) = 1, k= 1,2,..., p, where (-, •) is the inner product in the Hilbert 
space H. This imposes an equivalent normalization on the respective a^s, resulting from 

IN N \ 

1 = {uk,uk) =ly]aki4>(xi),'y]akj4>(xj)i 
\i=1 j =1 / 


7 If the dimension of HI is infinite, the definition of the covariance matrix needs a special interpretation, but we will not bother 
with it here. 
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N N 

= EE akiakjJC(i , j) 


i =i ;'=i 

= ICaic — NXkdJ. ak, k= 1,2,_/?. 


( 19 . 105 ) 


We are now ready to summarize the basic steps for performing a kernel PCA; that is, to compute 
the corresponding latent variables (kernel principal components). Given x„ e K/, n — 1,2,.... A', and 
a kernel function k(-. ■): 

• Compute the N x N kernel matrix, with elements IC(i, j ) = k(x/ . xj). 

• Compute the m dominant eigenvalues/eigenvectors . a/,, k = 1,2,..., m, of K, (Eq. (19.102)). 

• Perform the required normalization (Eq. (19.105)). 

• Given a feature vector rei 1 , obtain its low-dimensional representation by computing the m pro- 
jections onto each one of the dominant eigenvectors, 


N 



( 19 . 106 ) 


The operations given in Eq. (19.106) correspond to a nonlinedr mapping in the input space. Note 
that, in contrast to the linear PCA, the dominant eigenvectors Uk, k= 1,2,..., m, are not computed 
explicitly. Ali we know are the respective (nonlinear) projections z,k along them. However, after ali, 
this is what we are finally interested in. 

Remarks 19 . 7 . 

• Kernel PCA is equivalent to performing a Standard PCA in the RKHS H. It can be shown that all the 
properties associated with the dominant eigenvectors, as discussed for PCA, are stili valid for the 
kernel PCA. That is, (a) the dominant eigenvector directions optimally retain most of the variance; 
(b) the MSE in approximating a vector (function) in H in terms of the m dominant eigenvectors 
is minimal with respect to any other m directions; and (c) projections onto the eigenvectors are 
uncorrelated [161]. 

• Recall from Remarks 19.1 that the eigendecomposition of the Gram matrix was required for the 
metric multidimensional scaling (MDS) method. Because the kernel matrix is the Gram matrix in 
the respective RKHS, kernel PCA can be considered as a kernelized version of MDS, where inner 
products in the input space have been replaced by kernel operations in the Gram matrix. 

• Note that the kernel PCA method does not consider an explicit underlying structure of the manifold 
on which the data reside. 

• A variant of kernel PCA, known as the kernel entropy component analysis (ECA), has been devel- 
oped in [103], where the dominant directions are selected so as to maximize the Renyi entropy. 

19.9.2 GRAPH-BASED METH0DS 

Laplacian Eigenmaps 

The starting point of this method is the assumption that the points in the data set, X, lie on a smooth 
manifold M D X, whose intrinsic dimension is equal to m < I and it is embedded in M 1 , that is, M. C 
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RE The dimension rn is given as a parameter by the user. In contrast, this is not required in the kernel 


PCA, where m is the number of dominant components, which, in practice, is determined so that the 
gap between 'k m and k m +\ has a “large” value. 

The main philosophy behind the method is to compute the low-dimensional representation of the 
data so that local neighborhood information in X C A4 is optimally preserved. In this way, one at- 
tempts to get a solution that reflects the geometric structure of the manifold. To achieve this, the 
following steps are in order. 

Step 1: Construet a graph G — (V, E), where V = {v n , n — 1,2,..., N} is a set of vertices and 
E — { eij } is the corresponding set of edges connecting vertices (i>;, Vj), i, j = 1,2,..., N (see also 
Chapter 15). Each node v„ of the graph corresponds to a point x„ in the data set X. We connect i vj , 
that is, insert the edge eij between the respective nodes, if points xi,xj are “close” to each other. 
According to the method, there are two ways of quantifying “closeness.” Vertices Vj, v j are connected 
with an edge if: 

1. ||x,- — xj\\ 2 < e, for some user-dehned parameter e, where || ■ || is the Euclidean norm in RE or 

2. x j is among the k-nearest neighbors of x,- or Xj is among the Ar-nearest neighbors of x/, where 
A: is a user-dehned parameter and neighbors are chosen according to the Euclidean distance 
in M 1 . The use of the Euclidean distance is justihed by the smoothness of the manifold that allows to 
approximate, locally, manifold geodesies by Euclidean distances in the space where the manifold is 
embedded. The latter is a known resuit from differential geometry. 

For those who are unfamiliar with such concepts, think of a sphere embedded in the three- 
dimensional space. If somebody is constrained to live on the surface of the sphere, the shortest 
path to go from one point to another is the geodesic between these two points. Obviously this is 
not a straight line but an arc across the surface of the sphere. However, if these points are close 
enough, their geodesic distance can be approximated by their Euclidean distance, computed in the 
three-dimensional space. 

Step 2: Each edge, e,j, is associated with a weight, W(i, j). For nodes that are not connected, the 
respective weights are zero. Each weight, W(i, j), is a measure of the “closeness” of the respective 
neighbors, x,-, xj. A typical choice is 



wv,i)= exp ( 


if Vj, Vj correspond to neighbors, 
otherwise, 


0 , 


where a 2 is a user-dehned parameter. We form the N x N weight matrix W having as elements the 


weights W(i, j). Note that W is symmetric and it is sparse because, in practice, many of its elements 
turn out to be zero. 

Step 3: Dehne the diagonal matrix D with elements Da = JV W (i, j), i = 1,2,..., N, and also 
the matrix L D — W. The latter is known as the Laplacian matrix ofthe graph, G(V, E). Perform 
the generalized eigendecomposition 


Lu = XDu. 





19.9 NONLINEAR DIMENSIONALITY REDUCTION 1089 


Let 0 = Ao < A < Xi < ... < \ m be the smallest m + 1 eigenvalues. Ignore the u Q eigenvector corre- 
sponding to Xq = 0 and choose the next m eigenvectors u\, ui,..., u m . Then map 

x„ e R 1 1 —* z n e M." 1 , n = 1,2,..., N, 

where 

zl = [uin,U2n,---,u mn ], n=l,2,...,N. (19.107) 

That is, z n comprises the nth components of the m previous eigenvectors. The computational com- 
plexity of a general eigendecomposition solver amounts to O(IV 3 ) operations. However, for sparse 
matrices, such as the Laplacian matrix, L. efficient schemes can be employed to reduce complexity to 
be subquadratic in /V, e.g., the Lanczos algorithm [83]. 

The proof concerning the statement of step 3 will be given for the case of m = 1. For this case, 
the low-dimensional space is the real axis. Our path evolves along the lines adopted in [19]. The goal 
is to compute z n e R, n = 1,2,..., /V, so that connected points (in the graph, i.e., neighbors) stay as 
close as possible after the mapping onto the one-dimensional subspace. The criterion used to satisfy 
the closeness after the mapping is 


N N 

£l = ^J>;-z;) 2 WJ) (19.108) 

;=i ./=1 

to become minimum. Observe that if W(i, j ) has a large value (i.e., x ,, x / are close in M / ), then if the 
respective Zi , Z / are far apart in R it incurs a heavy penalty in the cost function. Also, points that are 
not neighbors do not affect the minimization as the respective weights are zero. For the more general 
case, where 1 < m < /, the cost function becomes 

N N 

£l = EEii z '- z ;II 2w ^)- 

;=t j =i 

Let us now reformulate Eq. (19.108). After some trivial algebra, we obtain 

= E r E w(i > +E ^E w «’» - 2 E E » 

i j j i i j 

= E +E z ) D n - 2 E E w j ) 

i j i j 

= 2z t Lz, (19.109) 


where 


L := D — W : Laplacian matrix of the graph. 


( 19 . 110 ) 


° In contrast to the notation used for PCA, the eigenvalues here are marked in ascending order. This is because, in this subsection, 
we are interested in determining the smallest values and such a choice is notationally more convenient. 
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and z T = [zi, Z2> • • • i Z,v]- The Laplacian matrix, L, is symmetric and positive semidefinite. The latter 
is readily seen from the definition in Eq. (19.109), where Ei is always a nonnegative scalar. Note 
that the larger the value of Da is, the more “important” is the sample x,. This is because it implies 
large values for W(i, j), j = 1,2,, N, and plays a dominant role in the minimization process. 
Obviously, the minimum of Ej is achieved by the trivial solution Zi = 0, i = 1,2,..., N. To avoid 
this, as is common in such cases, we constrain the solution to a prespecified norm. Hence, our problem 
now becomes 


min z T Lz, 

z 

s.t. z T Dz — 1. 

Although we can work directly on the previous task, we will slightly reshape it in order to use tools 
that are more familiar to us. Define 

y = D l/2 z, (19.111) 

and 

L = D~ l/2 LD~ l/2 , (19.112) 

which is known as the normalized graph Laplacian matrix. It is now readily seen that our optimization 
problem becomes 

min y T Ly, (19.113) 

y 

s.t. y T y= 1. (19.114) 

Using Lagrange multipliers and equating the gradient of the Lagrangian to zero, it turns out that the 
solution is given by 

Ly = Xy. (19.115) 

In other words, computing the solution becomes equivalent to solving an eigenvalue-eigenvector 
problem. Substituting Eq. (19.115) into the cost function in (19.113) and taking into account the con- 
straint (19.114), it turns out that the value of the cost associated with the optimal y is equal to X. 
Hence, the solution is the eigenvector corresponding to the minimum eigenvalue. However, the min¬ 
imum eigenvalue of L is zero and the corresponding eigenvector corresponds to a trivial solution. 
Indeed, observe that 


LD l/1 1 = D~ 1/2 LD~ l/2 D 1/2 l = D~ l/1 (D - W)1 = 0, 


where 1 is the vector having ali its elements equal to 1. In words, y — /J> 1 /2 1 is an eigenvector corre¬ 
sponding to the zero eigenvalue and it results in the trivial solution, z,- = 1, i = 1.2,..., /V. That is, 
ali the points are mapped onto the same point in the real line. To exclude this undesired solution, recall 
that Z is a positive semidefinite matrix and, hence, 0 is its smallest eigenvalue. In addition, if the graph 
is assumed to be connected, that is, there is at least one path (see Chapter 15) that connects any pair of 
vertices, D 1 ^ 2 1 is the only eigenvector associated with the zero eigenvalue, /,q [19]. Also, as Z is a sym¬ 
metric matrix, we know (Appendix A. 2) that its eigenvectors are orthogonal to each other. In the sequel, 
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we impose an extra constraint and we now require the solution to be orthogonal to D |/ ~l. Constraining 
the solution to be orthogonal to the eigenvector corresponding to the smallest (zero) eigenvalue drives 
the solution to the next eigenvector corresponding to the next smallest (nonzero) eigenvalue A. Note 
that the eigendecomposition of L is equivalent to what we called generalized eigendecomposition of L 
in step 3 before. 

For the more general case of m > 1, we have to compute the m eigenvectors associated with A < 

... < /.„! ■ As a matter of fact, for this case, the constraints prevent us from mapping into a subspace 
of dimension less than the desired m. For example, we do not want to project in a three-dimensional 
space and the points to lie on a two-dimensional plane or on a one-dimensional line. For more details, 
the interested reader is referred to the insightful paper [19]. 

Locat Linear Embedding (LLE) 

As was the case with the Laplacian eigenmap method, local linear embedding (LLE) assumes that the 
data points rest on a smooth enough manifold of dimension m, which is embedded in the K / space, with 
m < l [155]. The smoothness assumption allows us to further assume that, provided there are sufficient 
data and the manifold is “well” sampled, nearby points lie on (or close to) a “locally” linear patch 
of the manifold (see, also, related comments in Section 19.8.3). The algorithm in its simplest form is 
summarized in the following three steps: 

Step 1: For each point, x n , n — 1.2,..., /V, search for its nearest neighbors. 

Step 2: Compute the weights W(n, /), j = 1.2,..., N, that best reconstruet each point, x n , from 
its nearest neighbors, so as to minimize the cost 

N N 

argmin Ew = ^ |*„-^lF(n, ./)*„. f, (19.116) 

/2=1 j = 1 

where x nj denotes the /th neighbor of the nth point. (a) The weights are constrained to be zero for 
points which are not neighbors and (b) the rows of the weight matrix add to one, that is, 

N 

J2w(n,j)= 1, n = 1,2, (19.117) 

j= i 

That is, the sum of the weights, over all neighbors, must be equal to one. 

Step 3: Once the weights have been computed from the previous step, use them to obtain the corre¬ 
sponding points z n e R"', n = 1,2,..., N, so as to minimize the cost with respect to the unknown set 
of points Z = {z,i, n — 1,2,..., N}, 

N N 

argmin Zn . n =\,...,N E Z = X) ll z » “ ^ W( - n ’ II* (19.118) 

/2=1 7 = 1 

The above minimization takes place subject to two constraints, to avoid degenerate Solutions: (a) the 
outputs are centered, z n = 0, and (b) the outputs have unit covariance matrix [159]. Nearest points, 
in step 1, are searched in the same way as for the Laplacian eigenmap method. Once again, the use of 
the Euclidean distance is justified by the smoothness of the manifold, as long as the search is limited 
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“locally” among neighboring points. For the second step, the method exploits the local linearity of 
a smooth manifold and tries to predict linearly each point by its neighbors using the least-squares 
method. Minimizing the cost subject to the constraint given in Eq. (19.117) results in a solution that 
satisfies the following three properties: 

1. rotation invariance, 

2. scale invariance, 

3. translation invariance. 

The first two can easily be verified by the form of the cost function and the third one is the consequence 
of the imposed constraints. The implication of this is that the computed weights encode information 
about the intrinsic characteristics of each neighborhood and they do not depend on the particular point. 

The resulting weights, W(i, j ), reflect the intrinsic properties of the local geometry underlying the 
data, and because our goal is to retain the local information after the mapping, these weights are used 
to reconstruet each point in the R" ! subspace by its neighbors. As is nicely stated in [159], it is as if we 
take a pair of scissors to cut small linear patehes of the manifold and place them in the low-dimensional 
subspace. 

It turns out that solving (19.118) for the unknown points, z„, n= 1,2,..., N, is equivalent to: 

• performing an eigendecomposition of the matrix (I — W) T (I — W), 

• discarding the eigenvector that corresponds to the smallest eigenvalue, 

• taking the eigenvectors that correspond to the next (smaller) eigenvalues. These yield the low- 
dimensional latent variable scores, z n , n = 1, 2, ..., N. 

Once again, the involved matrix W is sparse and if this is taken into account the eigenvalue problem 
scales relatively well to large data sets with complexity subquadratic in N. The complexity for step 2 
scales as 0(iVk 3 ) and it is contributed by the solver of the linear set of equations with k unknowns 
for each point. The method needs two parameters to be provided by the user, the number of nearest 
neighbors, k (or e) and the dimensionality m. The interested reader can find more on the LLE method 
in [159]. 

Isometric Mapping (ISOMAP) 

In contrast to the two previous methods, which unravel the geometry of the manifold on a local basis, 
the ISOMAP algorithm adopts the view that only the geodesic distances between all pairs of the data 
points can reflect the true structure of the manifold. Euclidean distances between points in a manifold 
cannot represent it properly, because points that lie far apart, as measured by their geodesic distance, 
may be close when measured in terms of their Euclidean distance (see Fig. 19.17). ISOMAP is ba- 
sically a variant of the multidimensional scaling (MDS) algorithm, in which the Euclidean distances 
are substituted by the respective geodesic distances along the manifold. The essence of the method is 
to estimate geodesic distances between points that lie far apart. To this end, a two-step procedure is 
adopted. 

Step 1: For each point, x„, n = 1,2,..., N, compute the nearest neighbors and construet a graph 
G(V, E) whose vertices represent the data points and the edges connect nearest neighbors. (Nearest 
neighbors are computed with either of the two alternatives used for the Laplacian eigenmap method. 
The parameters k or 6 are user-defined parameters.) The edges are assigned weights based on the 
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FIGURE 19.17 

The point denoted by a “star” is deceptively closer to the point denoted by a “dot” than to the point denoted by a 
“box” if distance is measured in terms of the Euclidean distance. However, if one is constrained to travel along the 
spiral, the geodesic distance is the one that determines closeness and it is the “box” point that is closer to the “star.” 


respective Euclidean distance (for nearest neighbors, this is a good approximation of the respective 
geodesic distance). 

Step 2: Compute the pair-wise geodesic distances among all pairs (i, j ), i, j = 1,2,.... /V, along 
shortest paths through the graph. The key assumption is that the geodesic between any two points 
on the manifold can be approximated by the shortest path connecting the two points along the graph 
G(V, E). To this end, efficient algorithms can be used to achieve it with complexity 0(N 2 In /V + N 2 k) 
(e.g., Djikstar’s algorithm, [59]). This cost can be prohibitive for large values of N. 

Having estimated the geodesics between all pairs of point, the MDS method is mobilized. Thus, 
the problem becomes equivalent to performing the eigendecomposition of the respective Gram matrix 
and selecting the m most dominant eigenvectors to represent the low-dimensional space. After the 
mapping, Euclidean distances between points in the low-dimensional subspace match the respective 
geodesic distances on the manifold in the original high-dimensional space. As is the case in PCA 
and MDS, m is estimated by the number of significant eigenvalues. It can be shown that ISOMAP 
is guaranteed asymptotically (N —»■ oo) to recover the true dimensionality of a class of nonlinear 
manifolds [66,173]. 

All three graph-based methods share a common step for computing nearest neighbors in a graph. 
This is a problem of complexity 0(N 2 ) but more efficient search techniques can be used by employ- 
ing a special type of data structures (for example, [23]). A notable difference between the ISOMAP 
on the one side and the Laplacian eigenmap and LLE methods on the other is that the latter two ap- 
proaches rely on the eigendecomposition of sparse matrices as opposed to the ISOMAP that relies on 
the eigendecomposition of the dense Gram matrix. This gives a computational advantage to the Lapla¬ 
cian eigenmap and LLE techniques. Moreover, the calculation of the shortest paths in the ISOMAP 
is another computationally demanding task. Finally, it is of interest to note that the three graph-based 
techniques perform the task of dimensionality reduction while trying to unravel, in one way or another, 
the geometric properties of the manifold on which the data (approximately) lie. In contrast, this is 
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not the case with the kernel PCA, which shows no interest in any manifold learning. However, as the 
world is very small, in [88] it is pointed out that the graph-based techniques can be seen as special 
cases of kernel PCA! This becomes possible if data dependent kernels, derived from graphs encoding 
neighborhood information, are used in place of predebned kernel functions. 

The goal of this section was to present some of the most basic directions that have been suggested 
for nonlinear dimensionality reduction. Besides the previous basic schemes, a number of variants 
have been proposed in the literature (e.g., [21,70,163]). In [150] and [118] ( diffusiori maps ), the low- 
dimensional embedding is achieved so as to preserve certain measures that reflect the connectivity of 
the graph G(V, E). In [30,101], the idea of preserving the local information in the manifold has been 
carried out to detine linear transforms of the form z = A 1 x, and the optimization is now carried out 
with respect to the elements of A. The task of incremental manifold learning for dimensionality reduc¬ 
tion was more recently considered in [121]. In [172,184], the maximum variance unfolding method is 
introduced. The variance of the outputs is maximized under the constraint that (local) distances and 
angles are preserved among neighbors in the graph. Like the ISOMAP, it turns out that the top eigen- 
vectors of a Gram matrix have to be computed, albeit avoiding the computationally demanding step 
of estimating geodesic distances, as is required by the ISOMAP. In [164], a general framework, called 
graph embedding , is presented that offers a unified view for understanding and explaining a number 
of known (including PCA and nonlinear PCA) dimensionality reduction techniques and it also offers 
a platform for developing new ones. For a more detailed and insightful treatment of the topic, the in- 
terested reader is referred to [29]. A review of nonlinear dimensionality reduction techniques can be 
found in, for example, [32,123]. 

Example 19.7. Let a data set consisting of 30 points be in the two-dimensional space. The points resuit 
from sampling the spiral of Archimedes (see Fig. 19.18A), described by 

xi—a6co&d, X2=a0&in0. 

The points of the data set correspond to the values 0 — 0.57T, 0.7tt, 0.97T,..., 2.057T (0 is expressed 
in radians), and a — 0.1. For illustration purposes and in order to keep track of the “neighboring” 
information, we have used a sequence of six symbols, “x,” “+,” “O,” and “o” with black 

color, followed by the same sequence of symbols in red color, repeatedly. 

To study the performance of PCA for this case, where data lie on a nonlinear manifold, we first 
performed the eigendecomposition of the covariance matrix, estimated from the data set. The resulting 
eigenvalues are 

X 2 = 0.089 and kj = 0.049. 

Observe that the eigenvalues are comparable in size. Thus, if one would trust the “verdict” coming 
from PCA, the answer concerning the dimensionality of the data would be that it is equal to 2. More- 
over, after projecting along the direction of the principal component (the straight line in Fig. 19.18B), 
corresponding to k 2 , neighboring information is lost because points from different locations are mixed 
together. 

In the sequel, the Laplacian eigenmap technique for dimensionality reduction is employed, with e = 
0.2 and a — V0.5. The obtained results are shown in Fig. 19.18C. Looking from right to left, we see 
that the Laplacian method nicely “unfolds” the spiral in a one-dimensional straight line. Furthermore, 
neighboring information is retained in this one-dimensional representation of the data. Black and red 
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(C) 


FIGURE 19.18 

(A) A spiral of Archimedes in the two-dimensional space. (B) The previous spiral together with the projections 
of the sampled points on the direction of the first principal component, resulting from PCA. It is readily seen that 
neighboring information is lost after the projection and points corresponding to different parts of the spiral overlap. 
(C) The one-dimensional map of the spiral using the Laplacian method. In this case, the neighboring information is 
retained after the nonlinear projection and the spiral nicely unfolds to a one-dimensional line. 


areas are succeeding each other in the right order, and also, observing the symbols, one can see that 
neighbors are mapped to neighbors. 


Example 19.8. Fig. 19.19 shows samples from a three-dimensional spiral, parameterized as x\ = 
aO cos0, X 2 — aQ sin0, and sampled at 0 = 0.5 tt, O.Ijt , 0.97T,..., 2.05:r (8 is expressed in radians), 
a = 0.1, andx 3 = — 1, —0.8, — 0.6,..., 1. 


1 -, 

0 . 6 - 

0 . 2 - 

- 0 . 2 - 
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FIGURE 19.19 


Samples from a three-dimensional spiral. One can think of it as a number of two-dimensional spirals one above the 
other. Different symbols have been used in order to track neighboring information. 
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FIGURE 19.20 

Two-dimensional mapping of the spiral of Fig. 19.19 using the Laplacian eigenmap method. The three-dimensional 
structure is unfolded to the two-dimensional space by retaining the neighboring information. 


For illustration purposes and in order to keep track of the “identity” of each point, we have used red 
crosses and dots interchangeably, as we move upward in the X 3 dimension. Also, the first, the middle, 
and the last points for each level of xj, are denoted by black “O,” black and black respectively. 
Basically, ali points at the same level lie on a two-dimensional spiral. 

Fig. 19.20 shows the two-dimensional mapping of the three-dimensional spiral using the Lapla¬ 
cian method for dimensionality reduction, with parameter values e = 0.35 and a — V0.5. Comparing 
Figs. 19.19 and 19.20, we see that ali points corresponding to the same level X 3 are mapped across 
the same line, with the first point being mapped to the first one, and so on. That is, as was the case 
of Example 19.7, the Laplacian method unfolds the three-dimensional spiral into a two-dimensional 
surface, while retaining neighboring information. 


19.10 LOW RANK MATRIX FACT0RIZATI0N: A SPARSE MODEUNG PATH 

The low rank matrix factorization task has already been discussed from different perspectives. In this 
section, the task will be considered in a specific context; that of missing entries and/or in the presence of 
outliers. Such a focus is dictated by a number of more recent applications, especially in the framework 
of big data problems. To this end, sparsity promoting arguments will be mobilized to offer a fresh look 
at this old problem. We are not going to delve into many details and our purpose is to highlight the 
main directions and methods which have been considered. 

19.10.1 MATRIX C0MPLETI0N 

To recapitulate some of the main findings in Chapters 9 and 10. let us consider a signal vector s e K 7 , 
where only N of its components are observed and the rest are unknown. This is equivalent to sensing s 
via a sensing matrix having its N rows picked uniformly at random from the Standard (canonical) basis 
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<}> = /, where I is the / x l identity matrix. The question that was posed there was whether it is possible 
to recover s exactly based on these N components. From the theory presented in Chapter 9, we know 
that one can recover ali the components of s, provided that s is sparse in some basis or dictionary, T, 
which exhibits low mutual coherence with O = I, and N is large enough, as has been pointed out in 
Section 9.9. 

Inspired by the theoretical advances in compressed sensing, a question similar in flavor and with 
a prominent impact regarding practical applications was posed in [33]. Given an l\ x matrix M, 
assume that only N l\h among its entries are known. Concerning notation, we refer to a general 
matrix M, irrespective of how this matrix was formed. For example, it may correspond to an image 
array. The question now is whether one is able to recover the exact full matrix. This problem is widely 
known as matrix completiori [33]. The answer, although it might come as a surprise, is “yes” with high 
probability, provided that (a) the matrix is well structured and complies with certain assumptions, (b) 
it has a low rank, r «. I, where / = min(/i, / 2 ), and (c) N is large enough. Intuitively, this is plausible 
because a low rank matrix is fully described in terms of a number of parameters (degrees of freedom), 
which is much smaller than its total number of entries. These parameters are revealed via its SVD 


M = 07 Uj vj = U 

/=l 


o\ 


O 


O 


V 1 


oy 


(19.119) 


where r is the rank of the matrix, m, e R' 1 and 1 e R /2 , i = 1, 2,... ,r, are the left and right orthonor- 
mal singular vectors, spanning the column and row spaces of M, respectively, 07 , i = 1, 2,..., r, are 
the corresponding singular values, and U — [hi , « 2 , • ■ • , u r ], V = [« 1 , V 2 , • ■ ■ , ivT 

Let a m denote the vector containing ali the singular values of M, that is, a m = [cri, 02 , • - - . cr/] r . 

Then rank(M) := 11rr .v/Ilo- Counting the parameters associated with the singular values and vectors 
in Eq. (19.119), it turns out that the number of degrees of freedom of a rank r matrix is equal to 
dM = r(l\ + / 2 ) — r 2 (Problem 19.8). When r is small, dM is much smaller than /. 

Let us denote by £2 the set of N pairs of indices, (i, j), i = 1,2,..., l\, j — 1,2,..., A, of the 

locations of the known entries of M , which have been sampled uniformly at random. Adopting a 
similar rationale to the one running across the backbone of sparsity-aware learning, one would attempt 
to recover M based on the following rank minimization problem: 


min 

MgE'i x h 

s.t. 


PmIIo’ 

M(i,j) = M(iJ), (i, 7 ) e£2. 


(19.120) 


It turns out that assuming there exists a unique low rank matrix having as elements the specific known 
entries, the task in (19.120) leads to the exact solution [33]. However, compared to the case of sparse 
vectors, in the matrix completion problem the uniqueness issue gets much more involved. The follow¬ 
ing issues play a crucial part concerning the uniqueness of the task in (19.120). 

1. If the number of known entries is lower than the number of degrees of freedom, that is, N < dM, 
then there is no way to recover the missing entries whatsoever, because there is an infinite number 
of low rank matrices consistent with the N observed entries. 
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2. Even if /V > c/w, uniqueness is not guaranteed. It is required that the N elements with indices in 

Q are such that at least one entry per column and one entry per row are observed. Otherwise, even 
a rank one matrix M — cannot be recovered. This becomes ciear with a simple example. 

Assume that M is a rank one matrix and that no entry in the first column as well as in the last row is 
observed. Then, because for this case M(i, j) — <j\U\iV\j, it is ciear that no information concerning 
the first component of iq as well as the last component of u \ is available; hence, it is impossible to 
recover these singular vector components, regardless of which method is used. As a consequence, 
the matrix cannot be completed. On the other hand, if the elements of £2 are picked at random and 
N is large enough, one can only hope that Q is such so as to comply with the requirement above, 
i.e., at least one entry per row and column to be observed, with high probability. It turns out that 
this problem resembles the famous theorem in probability theory known as the coupon collector’s 
problem. According to this, at least N = CqI ln/ entries are needed, where Co is a constant [134]. 
This is the information theoretic limit for exact matrix completion [35] of any low rank matrix. 

3. Even if points (1) and (2) above are fulfilled, uniqueness is stili not guaranteed. In fact, not every 
low rank matrix is liable to exact completion, regardless of the number and the positions of the 
observed entries. Let us demonstrate this via an example. Let one of the singular vectors be sparse. 
Assume, without loss of generality, that the third left singular vector, « 3 , is sparse with sparsity 
level k = 1 and also that its nonzero component is the first one, that is, M 31 ^ 0. The rest of m, and 
ali v[ are assumed to be dense. Let us return to SVD for a while in Eq. (19.119). Observe that the 
matrix M is written as the sum of r l\ x /2 matrices er,it; v j, i = 1,..., r. Thus, in this specific case 
where «3 is k — 1 sparse, the matrix 03 « 3 has zeros everywhere except for its first row. In other 
words, the information that ( 73 M 3 U 3 brings to the formation of M is concentrated in its first row 
only. This argument can also be viewed from another perspective; the entries of M obtained from 
any row except the first one do not provide any useful information with respect to the values of 
the free parameters 03 , « 3 , 1 ) 3 . As a resuit, in this case, unless one incorporates extra information 
about the sparse nature of the singular vector, the entries from the first row that are missed are not 
recoverable, because the number of parameters concerning this row is larger than the number of the 
available data. 

Intuitively, when a matrix has dense singular vectors it is better rendered for exact completion as 
each one among the observed entries carries information associated with ali the cIm parameters that 
fully describe it. To this end, a number of conditions which evaluate the suitability of the singular 
vectors have been established. The simplest one is the following [33]: 




(19.121) 


where hb is a bound parameter. In fact, /in is a measure of the coherence of matrix U (and similarly 
of V ) (vis-a-vis the Standard basis), defined as follows: 

p,(U):=— max || Puei\\ 2 , (19.122) 

r l<i<h 


9 


This is a quantity different than the mutual coherence already discussed in Section 9.6.1. 
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where Pjj defines the orthogonal projection to subspace U and e , is the ith vector of the canon- 
ical basis. Note that when U results from SVD, then || Pu^i \\ = || U e i || . In essence, coherence 
is an index quantifying the extent to which the singular vectors are correlated with the Standard 
basis et, i = 1 , 2 , ..., Z. The smaller p. B , the less “spiky” the singular vectors are likely to be, and 
the corresponding matrix is better suited for exact completion. Indeed, assuming for simplicity a 
square matrix M, that is, Z i —h = U if any one among the singular vectors is sparse having a sin- 
gle nonzero component only, then, taking into account that uj u, — vj t>, = 1 , this value will have 
magnitude equal to one and the bound parameter will take its largest value possible, that is, /i /; = I. 
On the other hand, the smallest value that // u can get is 1, something that occurs when the compo- 
nents of all the singular vectors assume the same value (in magnitude). Note that in this case, due 
to the normalization, this common component value has magnitude j. Tighter bounds to a matrix 
coherence resuit from the more elaborate incoherence property [ 33 , 151 ] and the strong incoherence 
property [ 35 ]. In all cases, the larger the bound parameter is, the larger the number of known entries 
becomes, which is required to guarantee uniqueness. 

In Section 19.10.3, the aspects of uniqueness will be discussed in the context of a real-life appli- 
cation. 

The task formulated in (19. 120) is of limited practical interest because it is an NP-hard task. Thus, 
borrowing the arguments used in Chapter 9, the io (pseudo)norm is replaced by a convexly relaxed 
counterpart of it, that is, 


min 

MsR'l xl 2 


M 


1 ’ 


s.t. M(i, j) = M (i, j), (i, j) e Q, 


(19.123) 


where |«r J ^| 1 , that is, the sum of the singular values, is referred to as the nuclear norm of the ma¬ 
trix M, often denoted as M . The nuclear norm minimization was proposed in [74] as a convex 

* 

approximation of rank minimization, which can be cast as a semidefinite programming task. 


Theorem 19.1. Let M be an l\ x h matrix of rank r, which is a constant much smaller than 
l = min (A,/ 2 ), obeying (19.121). Suppose that we observe N entries of M with locations sampled 
uniformly at random. Then there is a positive constant C such that if 


N>Cp 4 B lln 2 l, 

then M is the unique solutiori to the task in (19.123) with probability at least 1 — / -3 


(19.124) 


There might be an ambiguity on how small the rank should be in order for the corresponding matrix 
to be characterized as “low rank.’’ More rigorously, a matrix is said to be of low rank ii r — 0(1), which 
means that r is a constant with no dependence (not even logarithmic) on /. Matrix completion is also 
possible for more general rank cases where, instead of the mild coherence property of (19.121), the 
incoherence and the strong incoherence properties [33,35,87,151] are mobilized in order to get similar 
theoretical guarantees. The detailed exposition of these alternatives is beyond the scope of this book. In 
fact, Theorem 19.1 embodies the essence of the matrix completion task: with high probability, nuclear 
norm minimization recovers all the entries of a low rank matrix M with no error. More importantly, the 
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number of entries, N, that the convexly relaxed problem needs is only by a logarithmic factor larger 
than the information theoretic limit, which, as was mentioned before, equates to Co/ln/. Moreover, 
similar to compressed sensing, robust matrix completion in the presence of noise is also possible as long 
as the request M(i, j) = M(i, j ) in Eqs. (19.120) and (19.123) is replacedby \\M(i, j) — M(i, j )\\2 < € 
[34]. Furthermore, the notion of matrix completion has also been extended to tensors (for example, [77, 
165]). 

19.10.2 ROBUST PCA 

The developments on matrix completion theory led, more recently, to the formulation and solution 
of another problem of high significance. To this end, the notation | M\\ \. that is, the i \ norm of a 
matrix, is introduced and defined as the sum of the absolute values of its entries, that is, || M |j ] = 
Y^j=i \M(i, y)|. In other words, it acts on the matrix as if this were a long vector. 

Assume now that M is expressed as the sum of a low rank matrix L and a sparse matrix S, that 
is, M = L + S. Consider the following convex minimization problem task, [36,43,187,193], which is 
usually referred to as principal component pursuit (PCP): 

min ||ffjt#||i+A.||S||i, (19.125) 

L,S 

s.t. L + S = M, (19.126) 

where L and S are both l\ x h matrices. It can be shown that solving the task in (19.125)— ( 19. 126) 
recovers both L and S according to the following theorem [36]. 

Theorem 19.2. The PCP recovers both L and S with probability at least 1 — c/ j where c is a 
constant , provided that: 

1. The support set of S is uniformly distributed among all sets of cardinality N. 

2. The number k ofnonzero entries of S is relatively small, that is, k < plih, where p is a suffciently 
small positive constatit. 

3. L obeys the incoherence property. 

4. The regularization parameter X is constant with value X = —rr , 

V‘2 

5. We have rank(L) < Cwith C being a constant. 

In other words, based on all the entries of a matrix M, which is known to be the sum of two 
unknown matrices L and S, with the first one being of low rank and the second being sparse, PCP 
recovers exactly, with probability almost 1, both L and S, irrespective of how large the magnitude of 
the entries of S are, provided that both r and k are sujfciently small. 

The applicability of the previous task is very broad. For example, PCP can be employed in order to 
find a low rank approximation of M. In contrast to the Standard PCA (SVD) approach, PCP is robust 
and insensitive in the presence of outliers, as these are naturally modeled, via the presence of S. Note 
that outliers are sparse by their nature. For this reason, the above task is widely known as robust PCA 
via nuclear norm minimization. (More classical PCA techniques are known to be sensitive to outliers 
and a number of alternative approaches have in the past been proposed toward its robustification [for 
example, [98,109]].) 
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When PCP serves as a robust PCA approach, the matrix of interest is L and S accounts for the 
outliers. However, PCP estimates both L and S. As will be discussed soon, another class of applications 
are well accommodated when the focus of interest is turned to the sparse matrix S itself. 

Remarks 19.8. 

• Just as i i minimization is the tightest convex relaxation of the combinatorial £q minimization prob- 
lem in sparse modeling, the nuclear norm minimization is the tightest convex relaxation of the 
NP-hard rank minimization task. Besides the nuclear norm, other heuristics have also been pro- 
posed, such as the log-determinant heuristic [74] and the max norm [76]. 

• The nuclear norm as a rank minimization approach is the generalization of the trace-related cost, 
which is often used in the control community for the rank minimization of positive semidefinite 
matrices [133]. Indeed, when the matrix is symmetric and positive semidefinite, the nuclear norm of 
M is the sum of the eigenvalues and, thus, it is equal to the trace of M. Such problems arise when, 
for example, the rank minimization task refers to covariance matrices and positive semidefinite 
Toeplitz or Hankel matrices (see, e.g., [74]). 

• Both matrix completion (19.123) and PCP (19.126) can be formulated as semidefinite programs and 
are solved based on interior point methods. However, whenever the size of a matrix becomes large 
(e.g., 100 x 100), these methods are deemed to fail in practice due to excessive computational load 
and memory requirements. As a resuit, there is an increasing interest, which has propelled intensive 
research efforts, for the development of efficient methods to solve both optimization tasks, or re- 
lated approximations, which scale well with large matrices. Many of these methods revolve around 
the philosophy of the iterative soft and hard thresholding techniques, as discussed in Chapter 9. 
However, in the current low rank approximation setting, it is the singular values of the estimated 
matrix that are thresholded. As a resuit, in each iteration, the estimated matrix, after thresholding its 
singular values, tends to be of lower rank. Either the thresholding of the singular values is imposed, 
such as in the case of the singular value thresholding (SVT) algorithm [31], or it results as a solution 
of regularized versions of (19.123) and (19.126) (see, e.g., [48,177]). Moreover, algorithms inspired 
by greedy methods such as CoSaMP have also been proposed (e.g., [124,183]). 

• Improved versions of PCP that allow for exact recovery even if some of the constraints of The- 
orem 19.2 are relaxed have also been developed (see, e.g., [78]). Fusions of PCP with matrix 
completion and compressed sensing are possible, in the sense that only a subset of the entries of M 
is available and/or linear measurements of the matrix in a compressed sensing fashion can be used 
instead of matrix entries (for example, [183,188]). Moreover, stable versions of PCP dealing with 
noise have also been investigated (for example, [199]). 

19.10.3 APPLICATIONS OF MATRIX COMPLETION AND ROBUST PCA 

The number of applications in which these techniques are involved is ever increasing and their extensive 
presentation is beyond the scope of this book. Next, some key applications are selectively discussed 
in order to reveal the potential of these methods and at the same time to assist the reader in better 
understanding the underlying notions. 

Matrix Completion 

A typical application where the matrix completion problem arises is in the collaborcitive filtering task 
(e.g., [167]), which is essential for building up successful recommender systems. Let us consider that 
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a group of individuals provide their ratings concerning products that they have enjoyed. Then, a matrix 
with ratings can be filled, where each row indexes a different individual and the columns index the 
products. As a popular example, take the case where the products are different movies. Inevitably, the 
associated matrix will be partially filled because it is not common that ali customers have watched 
ali the movies and submitted ratings for ali of them. Matrix completion comes to provide an answer, 
potentially in the affirmative, to the following question: Can we predict the ratings that the users would 
give to films that they have not seen yet? This is the task of a recommender system in order to encourage 
users to watch movies, which are likely to be of their preference. The exact objective of competition 
for the famous Netflix Prize 1 was the development of such a recommender system. 

The aforementioned problem provides a good opportunity to build up our intuition about the matrix 
completion task. First, an individuaPs preferences or taste in movies are typically governed by a small 
nurnber of factors, such as gender, the actors that appear in it, and the continent of origin. As a resuit, 
a matrix fully filled with ratings is expected to be low rank. Moreover, it is ciear that each user needs 
to have at least one movie rated in order to have any hope of filling out her/his ratings across all 
movies. The same is true for each movie. This requirement complies with the second assumption 
in Section 19.10.1, concerning uniqueness; that is, one needs to know at least one entry per row and 
column. Finally, imagine a single user who rates movies with criteria that are completely different from 
those used by the rest of the users. One could, for example, provide ratings at random or depending 
on, let us say, the first letter of the movie title. The ratings of this particular user cannot be described in 
terms of the singular vectors that model the ratings of the rest of the users. Accordingly, for such a case, 
the rank of the matrix increases by one and the user’s preferences will be described by an extra set of 
left and right singular vectors. Flowever, the corresponding left singular vector will comprise a single 
nonzero component, at the place corresponding to the row dedicated to this user, and the right singular 
vector will comprise her/his ratings normalized to unit norm. Such a scenario complies with the third 
point concerning uniqueness in the matrix completion problem, as previously discussed. Unless all the 
ratings of the specific user are known, the matrix cannot be fully completed. 

Other applications of matrix completion include system Identification [128], recovering structure 
from motion [46], multitask learning [10], and sensor network localization [136]. 

Robust PCA/PCP 

In the collaborative filtering task, robust PCA offers an extra attribute compared to matrix completion, 
which can be proved very crucial in practice. The users are allowed to even tamper with some of the 
ratings without affecting the estimation of the low rank matrix. This seems to be the case whenever the 
rating process involves many individuals in an environment, which is not strictly controlled, because 
some of them occasionally are expected to provide ratings in an ad hoc, or even malicious manner. 

One of the first applications of PCP was in video surveillance systems (e.g., [36]) and the main 
idea behind it appeared to be popular and extendable to a number of computer vision applications. 
Take the example of a camera recording a sequence of frames consisting of a merely static background 
and a foreground with a few moving objects, for example, vehicles and/or individuals. A common 
task in surveillance recording is to extract from the background the foreground, in order, for example, 
to detect any activity or to proceed with further processing, such as face recognition. Suppose the 


io 


http://www.netflixprize.com/. 
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Original scene Background (low rank) Foreground (sparse) 



FIGURE 19.21 

Background-foreground separation via PCP. 


successive frames are converted to vectors in lexicographic order and are then placed as columns in 
a matrix M. Due to the background, even though this may slightly vary due, for example, to changes 
in illumination, successive columns are expected to be highly correlated. As a resuit, the background 
contribution to the matrix M can be modeled as an approximately low rank matrix L. On the other 
hand, the objects in the foreground appear as “anomalies” and correspond to only a fraction of pixels 
in each frame; that is, to a limited number of entries in each column of M. Moreover, due to the motion 
of the foreground objects, the positions of these anomalies are likely to change from one column of M 
to the next. Therefore, they can be modeled as a sparse matrix S. 

Next, the above discussed philosophy is applied to a video acquired from a shopping mali surveil- 
lance camera [129], with the corresponding PCP task being solved with a dedicated accelerated proxi- 
mal gradient algorithm [130]. The results are shown in Fig. 19.21. In particular, two randomly selected 
frames are depicted together with the corresponding columns of the matrices L and S reshaped back to 
pictures. 


19.11 A CASE STUDY: FMRI DATA ANALYSIS 

In the brain, tasks involving action, perception, cognition, and so forth, are performed via the simul- 
taneous activation of a number of so-called functional brain networks (FBNs), which are engaged in 
proper interactions in order to effectively execute the task. Such networks are usually related to low- 
level brain functions and they are defined as a number of segregated specialized small brain regions, 
potentially distributed over the whole brain. For each FBN, the involved segregated brain regions de- 
fine the spatial map, which characterizes the specific FBN. Moreover, these brain regions, irrespective 
of their anatomical proximity or remoteness, exhibit strong functional connectivity, which is expressed 
as strong coherence in the activation time patterns of these regions. Examples of such functional brain 
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networks are the visual, sensorimotor, auditory, default-mode, dorsal attention, and executive control 
networks [148]. 

Functional magnetic resonance imaging (fMRI) [131] is a powerful noninvasive tool for detect- 
ing brain activity along time. Most commonly, it is based on blood oxygenation level dependent 
(BOLD) contrast, which translates to detecting localized changes in the hemodynamic flow of oxy- 
genated blood in activated brain areas. This is achieved by exploiting the different magnetic properties 
of oxygen-saturated versus oxygen-desaturated hemoglobin. The detected fMRI signal is recorded in 
both the spatial (three-dimensional) and the temporal (one-dimensional) domain. The spatial domain is 
segmented with a three-dimensional grid to elementary cubes of edge size 3-5 mm, which are named 
voxels. Indicatively, a complete volume scan typically consists of 64 x 64 x 48 voxels and it is acquired 
in one or two seconds [131]. Relying on adequate postprocessing, which effectively compensates for 
possible time lags and other artifacts [131], it is fairly accurate to assume that each acquisition is per- 
formed instantly. The, say, / in total voxel values, corresponding to a single scan, are collected in a 
flattened (row) one-dimensional vector, x n e M. 1 . Considering n = 1, 2,..., N successive acquisitions, 
the full amount of data is collected in a data matrix X e R Nx . Thus, each column, i = 1, 2,..., Z, of X 
represents the evolution in time of the values of the corresponding /th voxel. Each row, n — 1, 2,..., N, 
corresponds to the activation pattern, at the corresponding time n, over all / voxels. 

The recorded voxel values resuit from the cumulative contribution of several FBNs, where each one 
of them is activated following certain time patterns, depending on the tasks that the brain is performing. 
The above can be mathematically modeled according to the following factorization of the data matrix: 


m 



(19.127) 


where z / e R is a sparse vector of latent variables, representing the spatial map of the / th FBN having 
nonzero values only in positions that correspond to brain regions associated with the specific FBN, and 
a j e R' v represents the activation time course of the respective FBN. The model assumes that m FBNs 


have been activated. In order to understand better the previous model, take as an example the extreme 


case where only one set of brain regions (one FBN) is activated. Then matrix X is written as 


' « 1 ( 1 ) 
«i 62 ) 


X = a\z[ := 




I (voxels) 


Lai (AI) J 


where * denotes a nonzero element (active voxel in the FBN) and the dots zero ones. Observe that 
according to this model, all nonzero elements in the «th row of X resuit from the nonzero elements of 
Z\ multiplied by the sume number, a\ (n), n — 1,2,..., N. If now two FBNs are active, the model for 
the data matrix becomes 


X = <nz[ +CI 2 Z 2 = [a\,a 2 \ Z \ 

L Z 2 


Obviously, for m FBNs, Eq. (19.127) results. 
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One of the major goals of fMRI analysis is to detect, study, and characterize the different FBNs and 
to relate them to particular mental and physical activities. In order to achieve this, the subject (person) 
subjected to fMRI is presented with carefully designed experimental procedures, so that the activation 
of the FBNs will be as controlled as possible. 

ICA has been successfully employed for fMRI unmixing, that is, for estimating matrices A and 
Z above. If we consider each column of A to be a realization of a random vector x, the fMRI data 
generation mechanism can be modeled to follow the classical ICA latent model, that is, x = As, where 
the components of s are statistically independent and A is an unknown mixing matrix. The goal of ICA 
is to recover the unmixing matrix, W and Z. Matrix A is then obtained from W. The use of ICA in the 
fMRI task could be justified by the following argument. Nonzero elements of Z in the same column 
contribute to the formation of a single element of X, for each time instant n, and correspond to different 
FBNs. Thus, they are assumed to correspond to two statistically independent sources. 

As a resuit of the application of ICA on X, one hopes that each row of the obtained matrix Z could 
be associated with an FBN, that is, to a spatial activity map. Furthermore, the corresponding column 
of A could represent the respective time activation pattern. 

This approach will be applied next for the case of the following experimental procedure [61]. A vi- 
sual pattern was presented to the subject, in which an 8-Hz reversing black and white checkerboard 
was shown intermittently in the left and right visual ends for 30 s at a time. This is a typical block 
design paradigm in fMRI, consisting of three different conditions to which the subject is exposed, i.e., 
checkerboard on the left (red block), checkerboard on the right (black block), and no visual stimulus 
(white block). The subject was instructed to focus on the cross at the center during the full time of the 
experiment (Fig. 19.22). More details about the scanning procedure and the preprocessing of the data 
can be found in [61]. The Group ICA of fMRI Toolbox (GIFT) 1 simulation tool was used. 

When ICA is performed on the obtained data set, the aforementioned matrices Z and A are 
computed. Ideally, at least some of the rows of Z should constitute spatial maps of the true FBNs, and 
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FIGURE 19.22 

The fMRI experimental procedure used. 


11 It can be obtained from http://mialab.mm.org/software/gift/. 

12 The specific data set is provided as test data with GIFT. 
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FIGURE 19.23 

The two time courses follow well the fMRI experimental setup used. 


the corresponding columns of A should represent the activation patterns of the respective FBNs, which 
correspond to the specific experimentation procedure. 

The news is good, as shown in Fig. 19.23. In particular, Figs. 19.23A and B show two time courses 
(columns of A). For each one of them, one section of the associated spatial map (corresponding row 
of Z) is considered, which represents voxels of a slice in the brain. The areas that are activated (red) 
are those corresponding to the (a) left and (b) right visual cortex, which are the regions of the brain 
responsible for processing visual information. Activation of this part should be expected according to 
the characteristics of the specific experimental procedure. More interestingly, as seen in Fig. 19.23C, 
the two activation patterns, as represented by the two time courses, follow closely the two different 
conditions, namely, the checkerboard to be placed on the left or on the right from the point the subject 
is focusing on. 

Besides ICA, alternative methods, discussed in this chapter, can also be used, which can better ex- 
ploit the low rank nature of X. Dictionary learning is a promising candidate leading to notably good 
results (see, for example, [11,116,140,181]). In contrast to ICA, dictionary learning techniques build 
upon the sparse nature of the spatial maps, as already pointed out before. Besides the matrix factoriza- 
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tion techniques, tensor-based methods have also been used, via related low rank tensor factorizations, 
that can exploit the 3D-structure of the brain, see, e.g., [45] and the reference therein. 


PROBLEMS 

19.1 Show that the second principal component in PCA is given as the eigenvector corresponding to 
the second largest eigenvalue. 

19.2 Show that the pair of directions, associated with CCA, which maximize the respective correla- 
tion coefficient, satisfy the following pair of relations: 

-AtyWy = ? V\ \ W ■ ■ 

Zjy X U X = hSyyUy. 

19.3 Establish the arguments that verify the convergence of the k- SVD. 

19.4 Prove that Eqs. (19.83) and (19.89) are the same. 

19.5 Show that the ML PPCA tends to PCA as a 2 —»■ 0. 

19.6 Show Eqs. (19.91) and (19.92). 

19.7 Show Eq. (19.102). 

19.8 Show that the number of degrees of freedom of a rank r matrix is equal to r(l\ + h) — r 2 . 

MATLAB® EXERCISES 

19.9 This exercise reproduces the results of Example 19.1. Download the faces from this book’s 
website and read them one by one using the imread.m MATLAB® function. Store them as 
columns in a matrix X. Then, compute and subtract the mean in order for the rows to become 
zero mean. 

A direct way to compute the eigenvectors (eigenfaces) would be to use the svd.m MATLAB® 
function in order to perform an SVD to the matrix X, that is, X — UDV 1 . In this way, the 
eigenfaces are the columns of U. However, this needs a lot of computational effort because 
X has too many rows. Alternatively, you can proceed as follows. First compute the product 
A — X T X and then the SVD of A (using the SVD.m MATLAB® function) in order to compute 
the right singular vectors of X via A — V D 2 V 1 . Then calculate each eigenface according to 
Uj — Xvi, where oy is the zth singular value of the SVD. In the sequel, select one face at 
random in order to reconstruet it using the first 5, 30, 100, and 600 eigenvectors. 

19.10 Recompute the eigenfaces as in Exercise 19.9, using all the face images apart from one, which 
you choose. Then reconstruet that face that did not take part in the computation of the eigen¬ 
faces, using the first 300 and 1000 eigenvectors. Is the reconstructed face anywhere close to the 
true one? 

19.11 Download the fast ICA MATLAB® Software package to reproduce the results of the cocktail 
party example described in Section 19.5.5. The two voice and the music signals can be down- 
loaded from this book's website and read using the wavread.m MATLAB® function. Generate 
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a random mixing matrix A (3 x 3) and produce with it the three mixture signals. Each one of 
them simulates the signal received in each microphone. Then, apply FastICA in order to esti- 
mate the source signals. Use the MATLAB® function wavplay.m to listen to the original signals, 
the mixtures, and the recovered ones. Repeat the previous steps performing PCA instead of ICA 
and compare the results. 

19.12 This exercise reproduces the dictionary learning-based denoising Example 19.5. The image 
depicting the boat can be obtained from this book’s website. Moreover, A-SVD either needs to 
be implemented according to Section 19.6 or an implementation available on the web can be 
downloaded. ! 1 

Then the next steps are to be followed. First, extract from the image ali the possible sliding 
patches of size 12 x 12 using the im2col.m MATLAB® function and store them as columns 
in a matrix X. Using this matrix, train an overcomplete dictionary, T, of size (144 x 196), for 
100 k- SVD iterations with 7’o = 5. For the First iteration, the initial dictionary atoms are drawn 
from a zero mean Gaussian distribution and then normalized to unit norm. As a next step, 
denoise each image patch separately. In particular, assuming that _y, is the ith patch reshaped in 
column vector use the OMP (Section 10.2.1) in order to estimate a sparse vector 6i e R 196 with 
||0/||o = 5, such that ||_y,- — AOj || is small. Then y t = 'F0, is the ith denoised patch. Finally, 
average the values of the overlapped patches to form the full denoised image. 

19.13 Download one of the videos (they are provided in the form of a sequence of bitmap images) 
from http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html. 

In Section 19.10.3, the “shopping center” bitmap image sequence has been used. Read one 
by one the bitmap images using the imread.m MATLAB® function, convert them from color to 
grayscale using rgb2gray.m, and finally store them as columns in a matrix X. Download one of 
the MATLAB® implementations of an algorithm performing the robust PCA task from http:// 
perception.csl.illinois.edu/matrix-rank/sample_code.html. 

The “Accelerated Proximal Gradient” method and the accompanied proximal_gradient_ 
rpca.m MATLAB® function are a good and easy to use choice. Set k — 0.01. Note, however, 
that depending on the video used, this regularization parameter might need to be fine-tuned. 
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Leaky ReLU, 940 
Leamability, 113 
Learning 

algorithms, 180, 212 

Bayesian, 13, 15, 92, 596, 627, 630, 648, 681, 686, 
698, 715, 762, 1004 
Bayesian networks, 994 
curve, 189, 247 
decentralized, 227 

dictionary, 16, 439, 1069, 1072-1074, 1106 

graphical models, 859 

in RKHS, 579 

machines, 903 

methods, 3, 12, 13 

nonlinear models, 532 

online, 13, 14, 374, 384, 393, 396, 399, 402, 580, 
908 

parameter, 860 
phase, 10, 982 
probabilistic models, 992 
rate, 1027, 1028 
rule, 990 

task, 113, 380, 393, 402, 403, 581, 985, 988, 992, 
1014, 1041 
theory, 396, 400, 903 
Least modulus loss function, 554 
Least-mean-squares (LMS) 
performance, 204 
scheme, 206, 231, 404 
Least-squares (LS) 
classifiers, 78 
cost, 268 

estimator, 86. 89, 115, 116, 257-260, 293, 500 
linear classifier, 345 
overdetermined case, 265 

solution, 89, 91, 102, 256, 258, 265, 279, 320, 327, 
463 

task, 254, 268, 283, 475, 476, 479, 504 
underdetermined case, 265 
LeNet-5, 972 

Levenberg-Marquardt method, 402 
Levinson’s algorithm, 153 


Linear 

classifier, 77, 109, 302, 310, 322, 323, 373, 390, 

538, 564, 566, 567, 569 
congruential generator, 734 
convergence, 274 
dichotomy, 536 

regression, 15, 71. 72, 89, 91, 92, 101, 104, 114, 

116, 582, 596, 613, 640, 655, 660, 670, 722 
EM algorithm, 644 

model, 84, 101, 106, 118, 119, 243, 246, 254, 

286, 615, 634, 640, 644, 726 
modeling, 258 
subspace, 75, 440 
systems, 48 

time-invariant systems, 48 
unbiased estimator, 161, 258 
Linear e-insensitive loss function, 555 
Linear discriminant analysis (LDA), 312, 319 
Linear dynamic system (LDS), 844 
Linear programming (LP), 443 

Linearly constrained minimum variance beamforming, 
166 

Linearly separable, 537, 538, 910 
classes, 77, 373, 374, 390, 420, 537-539, 564, 572, 
903, 1023 

Liquid state machines, 984 

Local linear embedding (LLE), 16, 1041, 1091 

Local minimizer, 357, 418 

Log-loss function, 342 

Log-odds ratio, 341 

Log-partition function, 841 

Logistic regression, 317, 319, 321, 348, 399, 583, 637, 
639, 672, 673,716, 725 
Logistic regression loss function, 302 
Logistic regression model, 317, 681 
Long short-term memory network (LSTM), 980 
Loss function, 70-72, 74, 75, 77, 78, 80-82, 89, 112, 
122, 157, 180, 204, 253, 301, 302, 336, 

338, 339, 342, 355, 371, 372, 374, 388, 

390, 393, 394, 396-402, 405, 407, 420, 

482, 505, 548, 551, 554-556, 563, 572, 

581, 583, 913, 914, 1001, 1011, 1012, 

1019, 1025 
for regression, 344 
MSE, 244 
optimization, 548 
Loss matrix, 306 
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M 

Machine learning, 2-4, 24, 45, 54, 70, 91, 109, 195, 
204, 208, 260, 315, 334, 369, 373, 380, 
384, 427, 428, 463, 492, 500, 574, 575, 
596, 620, 653, 681, 758. 771, 791, 856, 
903, 982, 985, 1064 
optimization tasks, 61 
tasks, 14, 68, 254, 436, 732, 784 
Mahalanobis distance, 310 
Manifold learning, 459, 574, 1084, 1094 
for dimensionality reduction, 1094 
methods, 16 

Margin classification, 562 
Margin error, 563, 564, 570 
Marginal 

PDFs, 108, 1062 

probability, 23, 56, 57, 777, 803, 809, 828, 830, 865 
Markov blanket, 784 
Markov chain, 745 

Markov chain Monte Carlo (MCMC), 842 

Markov condition, 775 

Markov networks, 788 

Markov random field (MRF), 772, 788 

Matching pursuit, 474 

Matrix 

completion, 1097 
dictionary, 456 
error, 289 
noise, 296 

Max output unit, 940 
Max-product algorithm, 810 
Max-sum algorithm, 811 
Maximization 
step, 853 
view, 623 

Maximum a posteriori (MAP), 303 
probability estimation, 107 
Maximum entropy (ME) method, 61, 633 
Maximum likelihood (ML), 68, 99, 303, 597 
Type II, 611 

Maximum margin classifiers, 564 
Maximum margin parameter estimation, 862 
Maximum variance, 1045, 1057, 1068 
Maximum variance unfolding, 1094 
Mean field approximation, 649 
Mean field equations, 837 
Mean value, 25 


Mean-square error (MSE) 
cost, 158 

cost function, 244 
estimation, 94 
performance, 92 
prediction errors, 174 
Measurement noise, 167, 1085 
Memoryless nonlinear system. 534, 535 
Memoryless nonlinearity, 534 
Message passing schemes, 15, 484, 485, 772, 799, 
802, 804, 807-811, 821, 822, 825, 
827-829, 837-840, 842, 847, 849, 850 
Metric distance function, 387 
Metropolis-Hastings algorithm, 754 
Minibatch schemes, 923 

Minimization, 319, 321, 340, 346, 394, 395, 407, 415, 
476, 486, 499, 515, 523, 524, 549. 551, 
552, 833, 837, 1063, 1072, 1089 
problem in sparse modeling, 1101 
proximal, 409, 410 
solvers, 1070 

task, 89, 208, 275, 282, 411, 443, 452, 468, 477, 
487, 497, 498, 513, 515, 524, 550, 560, 
570, 575 

Minimum 

MSE performance, 116 
norm error, 255 
variance, 83, 86, 166 
estimator, 554 

unbiased linear estimator, 161 
Minimum description length (MDL), 611, 864 
Minimum enclosing ball (MEB), 576 
Minimum variance distortionless response (MVDR), 
166 

Minimum variance unbiased estimator (MVUE), 82, 
259 

Minimum variance unbiased (MVU), 82, 427 
Minmax optimization task, 998 
Minorize-maximization methods, 625 
Mirror descent algorithms (MDA), 415 
Misadjustment, 202 
Misclassification error, 303, 339, 562 
Mixture modeling, 596, 616, 648, 664, 698, 699, 702, 
703, 705, 845, 847 

Gaussian, 15, 619-621, 635, 636, 661, 662, 698 
Mixture of experts, 634 
Mixture of factor analyzers, 1084 
Mixture probabilities, 702, 704 



1124 Index 


Model complexity, 96, 110, 111, 664, 864 
Modulated wideband converter (MWC), 461 
Moment matching, 688 
Moment projection, 687 
Momentum term, 925 
Moore-Penrose pseudoinverse, 256 
Moral graphs, 798 
Moralization, 798 
Moreau envelope, 405, 485 
Moreau-Yosida regularization, 405 
Moving average model, 54 
Multiclass classification schemes, 575 
Multiclass logistic regression, 346, 638 
Multidimensional scaling, 1048 
Multinomial distribution, 31 
Multinomial probability distribution, 661 
Multinulli distribution, 699 
Multipath channels, 436 
Multiple additive regression tree (MART), 344 
Multiple kernel learning (MKL), 580 
leaming task, 581 

Multiple measurement vector (MMV), 670 
Multiple-cause networks, 786 
Multitask leaming, 15, 572, 1014, 1102 
Multivariate Gaussian distribution, 34, 36, 63, 102. 

119, 630, 631,643, 726 
Multivariate linear regression (MLR), 1057 
Mutual coherence of a matrix, 449 
Mutual information, 56 

N 

Naive Bayes classifier, 315, 347, 582 
Natural gradient, 402, 1064 
Natural language processing (NLP), 7, 1017 
Near-to-Toeplitz matrices, 280 
Nearest neighbor classifier, 315 
Negative gradient, 182 
Negative gradient direction, 183 
Negentropy, 1065 
NESTA, 485 

Nesterov’s momentum algorithm, 927 
NetWork in network, 966 
Neural machine translation (NMT), 7, 15, 1017 
Neural network (NN), 3, 908 
feed-forward, 908 

Neural Turing machine (NTM), 984 
Newton"s iterative minimization method, 271 
No free lunch theorem, 334 


Noise 

cancelation, 141, 143, 175 
canceler, 144, 145 
covariance matrix, 166 
disturbance, 596 
field, 138 

Gaussian, 86, 92, 175, 176, 259, 464, 615, 635, 645, 
665,670,716,719, 727 
matrix, 296 
outliers, 554 
precision, 645 
process, 143 
realization, 1074 

variance, 95, 119, 175, 208, 218, 224, 293, 421, 

497, 600, 603, 637, 640, 643, 644, 660, 
685,718, 1084 
white, 396 

Noisy observations, 242, 895 
Noisy-OR model, 772 
Nonconvex cost functions, 916 
Nonconvex optimization method, 855 
Nonconvex optimization task, 610, 999 
Nondifferentiable loss functions, 180, 481 
Nonexpansive operators, 363 
Nonlinear 
classifier, 537 
estimation, 196 
estimator, 245 
extensions, 395 
function, 122, 532, 539, 551 
mappings, 508 
modeling tasks, 532, 533 
regression, 710 
regression task, 552 
system, 533, 535 
thresholding function, 480 
vector functions, 170 
Nonlinearly separable classes, 77, 373 
Nonlinearly separable task, 910 
Nonnegative garrote thresholding rule, 435 
Nonnegative matrix factorization (NMF), 1040, 1075 
Nonnegatively tmncated Gaussian prior, 719 
Nonparallel hyperplanes, 440 
Nonparametric classifier, 315 
Nonparametric estimation, 114 
Nonparametric modeling, 551 
Nonseparable classes, 569 
Nonsmooth regularization, 14 
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Nonsparse representations, 512 
Nonwhite Gaussian noise, 101, 106 
Normal distributiori, 32 
Normal equations, 126 
Normal factor graph (NFG), 796 
Normalized LMS (NLMS), 211, 212, 218, 220, 247, 
283-285, 296, 378-380, 421 
set, 296, 421 
Nuclear norm, 1099 

Nuclear norm minimization, 1099-1101 
Numeric optimization schemes, 344 
Numerical errors, 170, 271 
Numerical performance, 271 
Nystrom approximation of the kernel matrix, 579 


Observations 
drawn, 231 
equation, 878 

sequence, 195, 377, 762, 894 
set, 455 
space, 453 
subspace, 459. 460 
transmitted sequence, 851 
vector, 255, 521 
Occam factor, 607 
One pixel camera, 458 
Online 

algorithms, 180, 199, 203, 205, 254, 268, 284, 393, 
395-397, 399, 401, 403-405, 414, 420, 
474, 499, 500, 503, 509,510 
convergence properties, 396 
in RKHSs, 579 
performance, 394 

performance for convex optimization, 352 
EM algorithm, 627 
formulation, 394 

leaming, 13, 14, 374, 384, 393, 396, 399, 580, 908 
for convex optimization, 393 
in RKHS, 579 
optimization rationale, 914 
performance, 227 
performance schemes, 399 
Processing, 369, 374, 500 
Processing rationale, 181 
rationale, 504 

sparsity promoting algorithms, 499 
stochastic gradient, 224, 404 


Optical character recognition (OCR), 9, 574, 856 
Optimal 

classifier, 316, 562 
MSE backward errors, 157 
proposal distributions, 888 
sparse Solutions, 449, 477 
Optimal brain damage, 941 
Optimal brain surgeon, 942 
Optimization 

algorithms, 415, 625, 676, 840, 914, 1021 

convex, 352, 474, 483, 485, 503 

criteria, 308, 681 

decentralized, 417 

error, 402 

joint, 577 

loss function, 548 

parameter, 999 

path, 575 

problems, 637, 686, 794 
procedure, 502, 580 
process, 339, 341, 602, 831, 859 
rationale, 997 

schemes, 254, 394, 412, 862 
sparsity promoting, 498 

task, 89, 98, 158, 208, 254, 265, 275, 287, 289, 336, 
338, 342, 358, 391, 410, 441, 457, 478, 
486, 493, 556, 557, 561, 564, 606, 633, 
634, 793, 985, 1006, 1042, 1043, 1054, 
1070 

techniques, 13, 371, 655, 833 
theory, 482, 483 
Optimized performance, 384 
Optimized values, 662 
Optimizing cost, 842 

Optimizing nonsmooth convex cost functions, 384 
Oracle properties, 501 

Ordinary binary classification tree (OBCT), 329 
Orthogonal matching pursuit (OMP), 475 
Orthogonal projection matrix, 208 
Orthogonality condition, 124 
Overcomplete dictionary, 439, 441, 442, 450, 464, 
511,512,515, 1041, 1069, 1074, 1082, 
1108 

Overfitting, 68, 91, 92, 95, 108, 265, 268, 321, 336, 
342, 549, 597, 599, 606, 913, 1015 
performance, 539 
Overlapping classes, 78, 569, 572 
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p 

Pairwise MRFs, 793 
Parameter 

error vector convergence, 188 
leaming, 860 
optimization, 999 

regularization, 402. 435, 501, 502, 552 
sparse vector, 445 

vector, 74, 83, 87, 89-91, 104, 107, 196, 204, 222, 
225, 237, 240, 397, 403, 407, 431, 446, 
474. 475, 478, 479, 499, 523 
Partial least-squares, 1056 
Particle filtering, 878, 881 
Partition function, 789 
Path 

history, 883 
iterations, 186 
optimization, 575 
PDFs 

conditional, 105, 303, 315, 317, 598, 767, 771, 866 
marginal, 108, 1062 
PEGASOS algorithm, 395 
Perceptron algorithm, 391 
Perceptron cost, 904 
Perceptron cost function, 904 
Perceptron learning rule, 903 
Perfect elimination sequence, 824 
Perfect maps, 787 

Perfect periodic sequence (PPSEQ), 218 
Performance 

analysis, 205, 212, 237 

bounds, 476, 480, 484, 567 

convergence, 212, 219, 379 

error, 222 

gains, 580 

index, 110, 204 

loss, 109, 337, 401,402 

measures, 487, 500, 501 

MSE, 92 

Online, 227 

online algorithms, 394 
overfitting, 539 
properties. 198 

Phase transition curve, 487, 489 
Pitman-Yor IBP, 710 
Point spread function (PSF), 175 
Poisson process, 762 
Polya urn model. 695 


Polygonal path, 478, 479 
Polytrees, 804 
Pooling, 963 

Positive definite function, 46 
Posterior 

approximation, 1005 
class probabilities, 862 
densities, 883 

distribution, 119, 612, 627, 698, 700 

DP, 694, 695, 725 

DP formulation, 695 

error covariance matrices, 174 

estimate, 652 

estimator, 168 

Gaussian, 106, 613, 614 

Gaussian distribution, 643 

Gaussian process, 716 

joint, 700 

mean, 602,716, 727, 881 
mean values, 670 
prediction, 716 
prediction variance, 716 

probability, 79, 302, 316, 321, 335, 336, 617, 621, 
622, 637, 673, 691, 694, 724, 797, 862 
probability distributions, 648 
Potential functions, 789 
Potts model, 793 

Power spectral density (PSD), 47, 577 
Prediction error, 157 
Prediction performance, 75, 333 
Primal estimated subgradient solver, 395 
Principal component, 1047 

Principal component analysis (PCA), 16, 1040, 1041 
Principal component pursuit (PCP), 1100 
Probabilistic 

graphical models, 690, 771, 772, 821 
PCA, 1078 
view, 878 
Probability 

discrete, 696, 774, 827 

distribution, 54, 68, 99, 625, 627, 628, 653, 693, 
699, 746, 747, 752, 762, 783, 787, 789, 
791, 795, 816-818, 824, 825, 827, 831, 
835, 841, 845, 988, 995-998, 1005 
error, 77, 78, 111, 302, 304, 308, 332, 345, 347, 
855, 862 

for discrete variables, 26 
function, 998 
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function space, 999 

joint, 22, 23, 777, 778, 793, 797, 800, 813, 816, 
828, 831, 836, 840, 847, 852, 993 
laws, 779 

mass, 640, 667, 693, 694, 697, 1001 
posterior, 79, 302, 316, 321, 335, 336, 617, 621, 
622, 637, 673, 691, 694, 724, 797, 862 
sequence, 708. 709, 725 
space, 56 

theory, 20-22, 36, 54, 107, 609, 735 
transition matrix, 750 

Probability density function (PDF), 20, 24, 303, 554, 
693, 775 

Probability mass function (PMF), 22, 628 
Probit regression, 322 
Process noise, 167, 887 
Product rule of probability, 23 
Projected gradient method (PGM), 391 
Projected Landweber method, 391 
Projected subgradient method, 392 
Projection approximation subspace tracking (PAST), 
1052 

Projection matrix, 209 

Projections onto convex sets (POCS), 14, 357, 365 
schemes, 371 
Property sets, 374 
Proportionate NLMS, 212 
Proposal distribution, 741, 742, 754, 755, 872, 875, 
882, 884, 885 

Proximal 

gradient, 412^114 
algorithms, 412 
mapping, 485 
minimization, 409, 410 
operator, 405-414, 420, 485 
point algorithm, 410 

Proximal forward-backward splitting operator, 414 
Pruning scheme, 348 
Pseudocovariance matrix, 131 
Pseudorandom generator, 735 
Pseudorandom number generator, 733 

Q 

Quadratic e-insensitive loss function, 555 
Quadratic discriminant analysis (QDA), 312 
Quasistationary rationale, 859 
Quatemary classification, 576 


Random 

components, 1047 
distribution, 698 
draw, 695 
elements, 664 
events, 56 
experiment, 41 
Gaussian, 348, 716 
matrix, 451, 453, 454, 457, 463 
measure, 872, 893 
nature, 29, 595, 661 
number generation, 733 
parameters, 595, 606, 613 
process, 20, 41, 44, 62, 134-137, 141, 149, 157, 
163, 215, 226, 284, 693, 710-712, 716, 
1065 

projection, 459 
projection matrices, 459 
sampling, 735 

sequence, 62, 63, 247, 733, 745 
variables, 20, 22, 23, 68, 71, 74, 80, 94, 124, 125, 
131, 134, 162, 185, 199, 200, 225, 244, 
257, 258, 275, 449, 595-597, 599, 606, 
610, 648, 656, 664, 667, 690, 693, 699, 
700, 732, 734-736, 738, 740, 752, 772, 
774-776, 778, 781, 784, 842, 843, 847, 
860, 865, 871, 1041, 1047, 1053 
convergence, 62 

vector, 26-28, 39, 49, 95, 102, 103, 128, 144, 149, 
154, 160, 194, 236, 500, 501, 596, 612, 
613, 628, 642, 699, 710, 723, 732, 739, 
741, 759, 761, 871, 875, 1041, 1046, 1053 
variable, 743 

walk, 271, 751-753, 756, 767 
walk model, 884 

Random demodulator (RD), 458, 460 
Random fields, 138 
Random forests, 333 
Random Fourier feature (RFF), 532, 577 
rationale, 580 

Random Markov fields, 788 
Random sign ensemble (RSE), 491 
Random signals, 41 
Randomized 
algorithms, 459 
pruning, 873 
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Randomness, 20, 59, 60, 62, 257, 456, 598, 627, 633, 
734 

Randomness tests, 734 
Rao-Blackwell theorem, 88 
Rao-Blackwellization, 893 
Rayleigh’s ratio, 326 
Rectified linear unit, 939 
Recurrent neural network (RNN), 15, 976 
Recursive least-square (RLS) 
convergence, 379 
scheme, 254, 275 

Recursive least-squares algorithm (RLS), 268 
Recursive scheme, 151, 197, 281, 292 
Reduced convex hull (RCH), 572 
Regression, 11, 13, 68, 71, 77, 93, 94, 99, 117, 301, 
317, 329, 344, 369, 374, 554, 575, 596, 

598, 599, 606, 648, 672, 673, 712 
Bayesian, 712, 714 

linear, 15,71,72, 89,91,92, 101, 104, 114, 116, 
582, 596, 613, 640, 655, 660, 670, 722 
model, 72, 86, 89, 96, 98, 115, 197, 200, 205, 231, 
232, 239, 246, 283, 296, 369, 383, 420, 

492, 505, 600, 611,613, 666, 672, 682, 683 
nonlinear, 710 

task, 9, 11,71.77, 95, 111, 113,301,316, 333,342, 
382, 431, 434, 532, 536, 555, 589, 605, 
611,613,640, 665,714,716 
task random parameters, 606 
trees, 329, 333, 344 
Regret analysis, 396 

Regularization, 13-15, 68, 89, 91-93, 260, 268, 380, 
395, 401, 410, 427, 431, 474, 493, 539, 
549-551, 560, 572, 940 
method, 430, 431 

parameter, 92, 402, 435, 501, 502, 552 
penalty, 497 
schemes, 342 
sparsity promoting, 509 
Regularized minimization, 405, 548 
Reinforcement learning (RL), 11, 12 
techniques, 984 
Reject option, 306 
Rejection probability, 755 
Rejection sampling, 739 
Rejection scheme, 742 
Relative entropy, 61, 936 
Relaxed projection operator, 364 


Relevance vector machine (RVM), 15 
Reparameterization trick, 1007 
Reproducing kernel Hilbert space (RKHS ), 114, 532, 
539, 540, 542, 545, 550, 551, 672, 710, 
711, 714, 1085, 1087 
Reproducing property, 539 
Resampling methods, 873, 875 
Residual networks, 973 
Resolvent of the subdifferential mapping, 411 
Restricted Boltzmann machine (RBM), 988 
Restricted isometry property (RIP), 452 
Ridge regression, 13, 14, 89-92, 265, 267, 428, 

" 432^135, 496, 497, 551-553, 560, 599, 
601,665,714 
Ring networks, 230 
Risk classification, 306 
Risk minimization, 347, 574 
Robbins-Monro algorithm, 194 
Robustness against overfitting, 609 
Root square estimation consistence, 500 
Rotation invariance, 1092 
Running intersection property, 824 

S 

Sample mean, 43 
SCAD thresholding rule, 435 
Scale invariance, 1072, 1092 
Scatter matrices, 324 
Schur algorithm, 157 

Scoring-based structure learning methods, 864 
Seismic signal processing, 143 
Semisupervised learning, 11, 12, 222, 992 
Sensing matrix, 446, 449, 453-458, 462, 463, 468, 
474, 477, 479, 486, 488, 490, 505, 
511-513, 523-525, 1085, 1096 
sparse, 524 
Separator, 826 

Sequential importance sampling (SIS), 881 

Sequential minimal optimization (SMO), 576 

Set membership algorithms, 379 

Sigmoid link function, 318 

Sigmoidal Bayesian networks, 785 

Sigmoidal networks, 993 

Signal 

Processing, 2, 6, 13-16, 41, 45, 48, 134, 140, 180, 
208, 352, 380, 384, 412, 414, 428, 454, 
461,463, 842, 850 
subspace, 1052 
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Singleton, 385, 388, 411 
Singular value, 261 

Singular value decomposition (SVD), 254, 260 
Singular value thresholding (SVT), 1101 
Singular vectors, 261 
Slack variables, 556 
Smoothing, 893 

Smoothly clipped absolute deviation (SCAD), 435, 
503 

Snapshot ensembling, 930 
Soft thresholding, 434 
Soundness, 787 
SpAPSM, 505, 510, 524 
Spark of a matrix, 447 
SpaRSA, 483 

Sparse, 380, 404, 428, 431, 436, 438, 439, 445, 446, 
448, 453, 457, 458, 490, 494, 499, 501, 
510, 513, 520, 560, 581, 1069, 1088, 1092, 
1097, 1098 
analysis, 513, 515 
analysis representation, 511 
coding, 1070 

coding representation vectors, 1071 
coding scheme, 1072 
DCT transforms, 465 
DFT, 460 

factor analysis, 1082 
Gaussian processes, 715 
learning, 510 

linear regression modeling, 653 
matrix, 577, 1089, 1093, 1100, 1101, 1103 
MKL variants, 581 

modeling, 14, 428, 473, 510, 516, 1040, 1096 
multitone signals, 460 
parameter vector, 447, 475 
regression modeling, 717 
representation, 447, 456, 459, 463, 512, 1069 
sensing matrix, 524 
signal, 454, 474 
signal recovery, 455, 487 
signal representation, 436 
solution, 442, 451, 455, 468, 494, 497, 500, 502, 
523, 581, 1071 
structure, 494, 1071 
synthesis, 513 
synthesis modeling, 516 
target vector, 479 
wavelet representation, 509, 510 


Sparse Bayesian learning (SBL), 15, 667, 669 
Sparsely represented signal, 458 
Sparsest solution, 440, 442, 448, 450-452, 477, 478, 
515 

Sparsity, 436, 437, 442, 451, 461, 463. 489, 492, 497, 
500, 513, 532, 560, 667, 669, 671, 672, 
685, 703, 1069, 1073 
assumption, 510 

constraint, 440, 456, 480, 672, 719, 1073 
in multipath, 437 

level, 445, 449, 452-454. 476, 479, 480. 484, 487, 
489, 507, 509, 510, 515, 1071, 1072, 1098 
promoting 

algorithm, 381, 474, 509 
Online algorithms, 212 
optimization, 498 
priors, 1082 

quantifying performance, 487 
regularization, 509 
regularizers, 1076 
structure, 454 

Spectral unmixing (SU), 718 

Spike and slab prior, 671 

Split Levinson algorithm, 154 

Square root estimation consistency, 501 

Stable embeddings, 459 

Standard Gaussian, 696 

Standard regression model, 500 

State space models, 843 

Stationary covariance function, 712 

Stationary Gaussian process, 712 

Statistical independence, 23 

Steady state, 203 

Steering vector, 164 

Stick breaking construction, 697 

Stick breaking representation of a DP, 697 

Stochastic 

approximation, 194 
convergence, 61 
gradient, 240, 275 
descent, 196, 404 
descent schemes, 214 
descent! algorithm, 198 
Online versions, 404 
rationale, 181, 241, 403 
scheme, 197, 232, 404 
matrix, 235 
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Stochastic matrices 
left, right and doubly, 233 
Stochastic processes, 41 
Stochastic volatility models, 894 
String kernels, 548 
Strongly attracting mapping, 365 
Strongly convex function, 399 
Student’s t distribution, 666 
Sub-Gaussian PDFs, 1061 
Sub-Nyquist sampling, 460 
Subclassifiers, 333 
Subdifferential, 385 

Subgradient, 352, 385, 386, 388-390, 393-395, 398, 
406,415,416,419, 905 
algorithm, 388, 390, 391, 393, 398, 399. 420 
method, 388, 390 
Subjective priors, 610 
Sublinear convergence rate, 482 
Suboptimal classifier, 315 
Subspace 

dimensionality, 440, 466 
estimation, 1052 
learning methods, 171 
linear. 75, 440 
observations, 459, 460 
signal, 1052 
tracking, 1052 
tracking problem, 1052 
Subspace pursuit (SP), 480 
Sufficient statistic, 87 
Sum rule for probabilities, 22 
Super-Gaussian PDFs, 1061 
Supervised classification, 113 
Supervised learning, 8, 9, 11, 12, 80, 988, 991, 992 
Supervised PCA, 1049 
Support consistency, 501 
Support vector. 556, 569 
Support vector machine (SVM), 395, 532, 672 
classification, 572 
classification task, 575 
classifiers, 576, 589 
kernel classifier, 587 
Support vector regression (SVR), 672 
Supporting hyperplane, 362 
Switching linear dynamic system (SLDS), 859 
System identification, 141 


T 

Tangent hyperplane, 355 
TensorFlow machine learning, 1026 
Test error, 96 
Text classification, 584 
Thresholding schemes, 490 
Time-delay neural networks, 976 
Time-frequency analysis, 516 
Toeplitz matrices, 150 
Tokens-tokenization, 1018 
Total least-squares (TLS) 
estimator, 293 
method, 287 
solution, 292, 295 
Total variation (TV), 497 
Training error, 96 
Transfer learning, 1013 

Transition probabilities, 746, 747, 749-751, 844-847, 
852, 854, 857 

Translation invariance, 1092 
Tree classifiers, 333 
Trees, 803 

Triangulated graphs, 822 

Two-stage thresholding (TST) scheme, 484 

Type I estimator, 602 

U 

Unbiased estimator, 64, 81-83, 161, 162, 174, 258, 
260, 313, 345, 744, 766, 872, 873, 897 
linear, 161, 258 

Unconstrained optimization, 485 
Uncorrelated noise, 597 
Uncorrelated randorn variables, 125 
Undirected graphical models, 788, 799, 840 
Uniform distribution, 32 
Uniform randorn projection (URP), 491 
Uniform spherical ensemble (USE), 491 
Univariate Gaussian, 643 
Universal approximation property, 947 
Unlearning contribution, 990 
Unmixing matrix, 1066 
Unobserved 

latent variables, 702 
randorn variables, 611 
Unregularized LS, 89, 434, 501, 502 
Unscented Kalrnan filters, 170 
Unsupervised learning, 11, 12, 988, 992, 1067 
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v 

Validation, 109, 946 
Value ratio (VR), 586 
Value similarity (VS), 585 
Variance, 25 

minimum, 83, 86, 166 
Variational 

approximation method, 681 
autoencoders, 1004, 1005 
EM algorithm, 623, 665, 720, 837, 1080 
message passing, 839 
parameters, 681 
posterior estimates, 710 
Vector regression, 672 
Vector space model (VSM), 584 
equivalent, 587 

Vertex component analysis (VCA), 720 
VGG-16, 972 
Viterbi algorithm, 851 
Viterbi reestimation, 854 
Volterra models, 534 


W 

Wake-sleep algorithm, 994 
Wasserstein distance, 1001 
Wasserstein probability, 1001 
Weak classifier, 337, 344, 349 
Weak convergence, 367 
Weight optimization, 576 
Weighting scheme, 495 
Welch bound, 449 
Well-posed problems, 91 
White noise, 51 

Widely linear estimator/filtering, 130 
Wiener models, 534 
Wiener-Hammerstein models, 534 
Wiener-Hopf equations, 126 
Wireless sensor network (WSN), 228 
Wirtinger calculus, 132 
Wishart PDFs, 662, 699 

Y 

Yule-Walker equations, 52 



